Sciweavers

WIDM
2004
ACM

Stylistic and lexical co-training for web block classification

13 years 8 months ago
Stylistic and lexical co-training for web block classification
Many applications which use web data extract information from a limited number of regions on a web page. As such, web page division into blocks and the subsequent block classification have become a preprocessing step. We introduce PARCELS, an open-source, co-trained approach that performs classification based on separate stylistic and lexical views of the web page. Unlike previous work, PARCELS performs classification on fine-grained blocks. In addition to table-based layout, the system handles real-world pages which feature layout based on divisions and spans as well as stylistic inference for pages using cascaded style sheets. Our evaluation shows that the co-training process results in a reduction of 28.5% in error rate over a single-view classifier and that our approach is comparable to other state-of-the-art systems. Categories and Subject Descriptors I.7.m [Document and Text Processing]: Miscellaneous; H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia. Genera...
Chee How Lee, Min-Yen Kan, Sandra Lai
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Where WIDM
Authors Chee How Lee, Min-Yen Kan, Sandra Lai
Comments (0)