Web Page Cleaning for Web Mining through Feature Weighting

11 years 1 months ago
Web Page Cleaning for Web Mining through Feature Weighting
Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web mining, e.g., clustering and classification. In this paper, we propose a novel feature weighting technique to deal with Web page noise to enhance Web mining. This method first builds a compressed structure tree to capture the common structure and comparable blocks in a set of Web pages. It then uses an information based measure to evaluate the importance of each node in the compressed structure tree. Based on the tree and its node importance values, our method assigns a weight to each word feature in its content block. The resulting weights are used in Web mining. We evaluated the proposed technique with two Web mining tasks, Web page clustering and Web page classification. Experimental result...
Lan Yi, Bing Liu
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Authors Lan Yi, Bing Liu
Comments (0)