Sciweavers

139 search results - page 2 / 28
» An Approach to Identify Duplicated Web Pages
Sort
View
KDD
2006
ACM
185views Data Mining» more  KDD 2006»
14 years 5 months ago
Understanding Content Reuse on the Web: Static and Dynamic Analyses
Abstract. In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar c...
Ricardo A. Baeza-Yates, Álvaro R. Pereira J...
CPM
2000
Springer
177views Combinatorics» more  CPM 2000»
13 years 9 months ago
Identifying and Filtering Near-Duplicate Documents
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch...
Andrei Z. Broder
SIGIR
2008
ACM
13 years 5 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
IEAAIE
2003
Springer
13 years 10 months ago
Applying Semantic Links for Classifying Web Pages
Automatic hypertext classification is an essential technique for organizing vast amount of Internet Web pages or HTML documents. One the of problems in classifying Web pages is tha...
Ben Choi, Qing Guo
ECAI
2000
Springer
13 years 9 months ago
An Instance-based Approach for Identifying Candidate Ontology Relations within a Multi-Agent System
Discovering related concepts in a multi-agent system among agents with diverse ontologies is difficult using existing knowledge representation languages and approaches. We describ...
Andrew B. Williams, Costas Tsatsoulis