Do not crawl in the DUST: different URLs with similar text

13 years 10 months ago

Download www2007.org

We consider the problem of dust: Diﬀerent URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various diﬀerent URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines dust eﬀectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can beneﬁt from information about dust to increase the eﬀectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank. Categories and Subject Descriptors: H.3.3: Information Search and Retrieval. General Terms: Algorithms.

Uri Schonfeld, Ziv Bar-Yossef, Idit Keidar

Real-time Traffic

Diﬀerent Url | Internet Technology | Such Duplicate Urls | Web Server | WWW 2006 |

claim paper

Added	14 Jun 2010
Updated	14 Jun 2010
Type	Conference
Year	2006
Where	WWW
Authors	Uri Schonfeld, Ziv Bar-Yossef, Idit Keidar

Sciweavers

Do not crawl in the DUST: different URLs with similar text

Diﬀerent Url | Internet Technology | Such Duplicate Urls | Web Server | WWW 2006 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers