Sciweavers

FLAIRS
2001

Extracting Partial Structures from HTML Documents

13 years 5 months ago
Extracting Partial Structures from HTML Documents
The new wrapper model for extractiong text data from HTML documents is introduced. The Kushmerick's wrapper class (Kusshmerick 2000) may be unsuccessful in the case that sufficiently long delimiters are not found. The wrapper class introduced in this paper partially overcomes this difficulty by using the tree structures of HTML documents. The learning problem to learn such a wrapper program from given text is considered. Moreover, we try to expand our wrapper to extract a portion of HTML not only text attributes.
Hiroshi Sakamoto, Yoshitsugu Murakami, Hiroki Arim
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2001
Where FLAIRS
Authors Hiroshi Sakamoto, Yoshitsugu Murakami, Hiroki Arimura, Setsuo Arikawa
Comments (0)