IEPAD: information extraction based on pattern discovery

13 years 4 months ago
IEPAD: information extraction based on pattern discovery
The research in information extraction (IE) regards the generation of wrappers that can extract particular information from semistructured Web documents. Similar to compiler generation, the extractor is actually a driver program, which is accompanied with the generated extraction rule. Previous work in this field aims to learn extraction rules from users' training example. In this paper, we propose IEPAD, a system that automatically discovers extraction rules from Web pages. The system can automatically identify record boundary by repeated pattern mining and multiple sequence alignment. The discovery of repeated patterns are realized through a data structure call PAT trees. Additionally, repeated patterns are further extended by pattern alignment to comprehend all record instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieve 97 percent extraction over fourteen popular s...
Chia-Hui Chang, Shao-Chen Lui
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2001
Where WWW
Authors Chia-Hui Chang, Shao-Chen Lui
Comments (0)