Towards a Content-Provider-Friendly Web Page Crawler

15 years 11 months ago

Download leo.saclay.inria.fr

Search engine quality is impacted by two factors: the quality of the ranking/matching algorithm used and the freshness of the search engine’s index, which maintains a “snapshot” of the Web. Web crawlers capture web pages and refresh the index, but this is always a never-ending quest, as web pages get updated frequently (and thus have to be re-crawled). Knowing when to re-crawl a web page is fundamentally linked to the freshness of the index, given the size of the Web today and the inherent resource constraints: re-crawling too frequently leads to wasted bandwidth, recrawling too infrequently brings down the quality of the search engine. In this work, we address the scheduling problem for web crawlers, with the objective of optimizing the quality of the index (i.e., maximize the freshness probability of the local repository as well as of the index). Towards this, we utilize feedback from the users (content providers) on when their web pages are updated and consider the entire spe...

Jie Xu, Qinglan Li, Huiming Qu, Alexandros Labrini

Real-time Traffic

Internet Technology | Search Engine | Search Engine Index | Web Pages | WEBDB 2007 |

claim paper

» Effective webscale crawling through website analysis

» Category ranking for personalized search

» A clusteringbased sampling approach for refreshing search engines database

» To search or to crawl towards a query optimizer for textcentric tasks

Post Info
More Details (n/a)

Added	09 Jun 2010
Updated	09 Jun 2010
Type	Conference
Year	2007
Where	WEBDB
Authors	Jie Xu, Qinglan Li, Huiming Qu, Alexandros Labrinidis

Comments (0)

Sciweavers

Towards a Content-Provider-Friendly Web Page Crawler

Internet Technology | Search Engine | Search Engine Index | Web Pages | WEBDB 2007 |

Explore & Download

Productivity Tools

Sciweavers