Sciweavers

2 search results - page 1 / 1
» Do not crawl in the DUST: different URLs with similar text
Sort
View
WWW
2006
ACM
13 years 10 months ago
Do not crawl in the DUST: different URLs with similar text
We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and...
Uri Schonfeld, Ziv Bar-Yossef, Idit Keidar
KDD
2008
ACM
183views Data Mining» more  KDD 2008»
14 years 5 months ago
De-duping URLs via rewrite rules
A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal...
Anirban Dasgupta, Ravi Kumar, Amit Sasturkar