Sciweavers

AAAI
2006

Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment

13 years 5 months ago
Automatic Wrapper Generation Using Tree Matching and Partial Tree Alignment
This paper is concerned with the problem of structured data extraction from Web pages. The objective of the research is to automatically segment data records in a page, extract data items/fields from these records and store the extracted data in a database. In this paper, we first introduce the extraction problem, and then discuss the main existing approaches and their limitations. After that, we introduce a novel technique (called DEPTA) to automatically perform Web data extraction. The method consists of three steps: (1) identifying data records with similar patterns in a page, (2) aligning and extracting data items from the identified data records and (3) generating tree-based regular expressions to facilitate later extraction from other similar pages. The key innovation is the proposal of a new multiple tree alignment algorithm called partial tree alignment, which was found to be particularly suitable for Web data extraction. This paper is based on our work published in KDD-03 and...
Yanhong Zhai, Bing Liu
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2006
Where AAAI
Authors Yanhong Zhai, Bing Liu
Comments (0)