Sciweavers

CIKM
2008
Springer

A densitometric approach to web page segmentation

13 years 6 months ago
A densitometric approach to web page segmentation
Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from Quantitative Linguistics and strategies borrowed from the area of Computer Vision. We utilize the notion of text-density as a measure to identify the individual text segments of a web page, reducing the problem to solving a 1D-partitioning task. The distribution of segmentlevel text density seems to follow a negative hypergeometric distribution, described by Frumkina's Law. Our extensive evaluation confirms the validity and quality of our approach and its applicability to the Web. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Algorithms, Experimentation Keywords Web Page Segmentation, Full-text Extraction, Template Detection, Noise Removal
Christian Kohlschütter, Wolfgang Nejdl
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CIKM
Authors Christian Kohlschütter, Wolfgang Nejdl
Comments (0)