Extracting Partial Structures from HTML Documents

15 years 9 months ago

Download qir.kyushu-u.ac.jp

The new wrapper model for extractiong text data from HTML documents is introduced. The Kushmerick's wrapper class (Kusshmerick 2000) may be unsuccessful in the case that sufficiently long delimiters are not found. The wrapper class introduced in this paper partially overcomes this difficulty by using the tree structures of HTML documents. The learning problem to learn such a wrapper program from given text is considered. Moreover, we try to expand our wrapper to extract a portion of HTML not only text attributes.

Hiroshi Sakamoto, Yoshitsugu Murakami, Hiroki Arim

Real-time Traffic

Artificial Intelligence | FLAIRS 2001 | HTML Documents | Kushmerick's Wrapper Class | Wrapper Class |

claim paper

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2001
Where	FLAIRS
Authors	Hiroshi Sakamoto, Yoshitsugu Murakami, Hiroki Arimura, Setsuo Arikawa

Comments (0)

Sciweavers

Extracting Partial Structures from HTML Documents

Artificial Intelligence | FLAIRS 2001 | HTML Documents | Kushmerick's Wrapper Class | Wrapper Class |

Explore & Download

Productivity Tools

Sciweavers