Sciweavers

WIDM
2004
ACM

Probabilistic models for focused web crawling

13 years 9 months ago
Probabilistic models for focused web crawling
A Focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models(HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context. In this paper, we present the use of HMM for focused web crawling, and compare it with Best-First strategy. Furthermore, we discuss the concept of using CRFs to overcome the difficulties with HMMs and support the use of many, arbitrary and overlapping features. Finally, we describe a design of a system applying CRFs for focused web crawling, that is currently being implemented. Categories and Subject Descriptors H.5.4 [Information interfaces and presentation]: Hypertext/hypermedia; I.5.4 [Pattern recognition]: Applications, Text processing; I.2.6 [Artificial intelligence]: Learning; I.2.8 [Artificial int...
Hongyu Liu, Evangelos E. Milios, Jeannette Janssen
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Where WIDM
Authors Hongyu Liu, Evangelos E. Milios, Jeannette Janssen
Comments (0)