Extracting Web Data Using Instance-Based Learning

13 years 8 months ago
Extracting Web Data Using Instance-Based Learning
This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the sa...
Yanhong Zhai, Bing Liu
Added 25 Jun 2010
Updated 25 Jun 2010
Type Conference
Year 2005
Where WISE
Authors Yanhong Zhai, Bing Liu
Comments (0)