Cleaning Web Pages for Effective Web Content Mining

10 years 8 months ago
Cleaning Web Pages for Effective Web Content Mining
Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit webpage organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of w...
Jing Li, Christie I. Ezeife
Added 13 Oct 2010
Updated 13 Oct 2010
Type Conference
Year 2006
Where DEXA
Authors Jing Li, Christie I. Ezeife
Comments (0)