Recrawl scheduling based on information longevity

16 years 4 months ago

Download www2008.org

It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algo...

Christopher Olston, Sandeep Pandey

Real-time Traffic

Ephemeral Content | Internet Technology | Multiple Page Updates | Retrieval General Terms | WWW 2008 |

claim paper

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Christopher Olston, Sandeep Pandey

Comments (0)

Sciweavers

Recrawl scheduling based on information longevity

Ephemeral Content | Internet Technology | Multiple Page Updates | Retrieval General Terms | WWW 2008 |

Explore & Download

Productivity Tools

Sciweavers