The prevailing model for digital preservation is that archives should be similar to a “fortress”: a large, protective infrastructure built to defend a relatively small collect...
Web sites that accept and display content such as wiki articles or comments typically filter the content to prevent injected script code from running in browsers that view the sit...
In this paper, we describe CALM, a method for building statistical language models for the Web. CALM addresses several unique challenges dealing with the Web contents. First, CALM...
Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed ...
We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of We...