Web data extraction based on partial tree alignment

11 years 6 months ago
Web data extraction based on partial tree alignment
This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method bas...
Yanhong Zhai, Bing Liu
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2005
Where WWW
Authors Yanhong Zhai, Bing Liu
Comments (0)