Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar c...
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch...
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
Automatic hypertext classification is an essential technique for organizing vast amount of Internet Web pages or HTML documents. One the of problems in classifying Web pages is tha...
Discovering related concepts in a multi-agent system among agents with diverse ontologies is difficult using existing knowledge representation languages and approaches. We describ...