xCrawl: A High-Recall Crawling Method for Web Mining

15 years 10 months ago

Download ls13-www.cs.uni-dortmund.de

Web Mining Systems exploit the redundancy of data published on the Web to automatically extract information from existing web documents. The ﬁrst step in the Information Extraction process is thus to locate within a limited period of time as many web pages as possible that contain relevant information, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e. the percentage of documents found and identiﬁed as relevant compared to the number of existing documents. A higher recall value implies that more redundant data is available, which in turn leads to better results in the subsequent fact extraction phase. In this paper, we propose XCRAWL, a new focused crawling method which outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit navig...

Kostyantyn M. Shchekotykhin, Dietmar Jannach, Gerh

Real-time Traffic

Crawling Techniques | Data Mining | Focused Crawling | ICDM 2008 | Web Mining System |

claim paper

» Mining Anchor Text Trends for Retrieval

» Mining User Comment Activity for Detecting Forum Spammers in YouTube

» Watermarking the Outputs of Structured Prediction with an application in Statistical Machi...

» Geographic web usage estimation by monitoring DNS caches

» Classifying web sites

» Estimating the global pagerank of web communities

» Mining templates from search result records of search engines

» Identifying comparable entities on the web

Post Info
More Details (n/a)

Added	30 May 2010
Updated	30 May 2010
Type	Conference
Year	2008
Where	ICDM
Authors	Kostyantyn M. Shchekotykhin, Dietmar Jannach, Gerhard Friedrich

Comments (0)

Sciweavers

xCrawl: A High-Recall Crawling Method for Web Mining

Crawling Techniques | Data Mining | Focused Crawling | ICDM 2008 | Web Mining System |

Explore & Download

Productivity Tools

Sciweavers