Sciweavers

IJCAI
2003

Expressive Power of Tree and String Based Wrappers

13 years 6 months ago
Expressive Power of Tree and String Based Wrappers
There exist two types of wrappers: the string based wrapper such as the LR wrapper, and the tree based wrapper. A tree based wrapper designates extraction regions by nodes on the trees of semistructured documents. The tree based wrapper seems to be more powerful than the string based one. There exist, however, many HTML documents on the Web such that a standard tree based wrapper fails to extract contents because they are structured by presentational tags, punctuation symbols, and white spaces. Moreover, some of such documents use multi-byte characters for structuring. To treat some of such documents, we propose automatic wrapper generation based on common substring detection and to use input documents without any modification. In this framework, a part of text elements including white spaces and multibyte characters can be a part of a wrapper. We show the superiority such wrappers to usual wrappers created after document are parsed and modified. However, there still exist HTML docu...
Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where IJCAI
Authors Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa
Comments (0)