Sciweavers

31 search results - page 1 / 7
» Detecting near-duplicates for web crawling
Sort
View
SIGIR
2008
ACM
13 years 4 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
LAWEB
2003
IEEE
13 years 9 months ago
On the Evolution of Clusters of Near-Duplicate Web Pages
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis ove...
Dennis Fetterly, Mark Manasse, Marc Najork
WWW
2008
ACM
14 years 5 months ago
Detecting image spam using visual features and near duplicate detection
Email spam is a much studied topic, but even though current email spam detecting software has been gaining a competitive edge against text based email spam, new advances in spam g...
Bhaskar Mehta, Saurabh Nangia, Manish Gupta 0002, ...
ICMCS
2007
IEEE
149views Multimedia» more  ICMCS 2007»
13 years 10 months ago
SICO: A System for Detection of Near-Duplicate Images During Search
Duplicate and near-duplicate digital image matching is beneficial for image search in terms of collection management, digital content protection, and search efficiency. In this ...
Jun Jie Foo, Ranjan Sinha, Justin Zobel
WWW
2008
ACM
14 years 5 months ago
Efficient similarity joins for near duplicate detection
With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we ...
Chuan Xiao, Wei Wang 0011, Xuemin Lin, Jeffrey Xu ...