CETR: content extraction via tag ratios

12 years 8 months ago
CETR: content extraction via tag ratios
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: [In...
Tim Weninger, William H. Hsu, Jiawei Han
Added 14 May 2010
Updated 14 May 2010
Type Conference
Year 2010
Where WWW
Authors Tim Weninger, William H. Hsu, Jiawei Han
Comments (0)