Sciweavers

24 search results - page 2 / 5
» DOM-based content extraction of HTML documents
Sort
View
DEXAW
2008
IEEE
123views Database» more  DEXAW 2008»
13 years 11 months ago
Text Extraction from the Web via Text-to-Tag Ratio
– We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-to-Tag Ratio rather than specific HTML cues that may not be constant acr...
Tim Weninger, William H. Hsu
DOCENG
2009
ACM
13 years 11 months ago
Web document text and images extraction using DOM analysis and natural language processing
: © Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing Parag Mulendra Joshi, Sam Liu HP Laboratories HPL-2009-187 Web page text extraction,...
Parag Mulendra Joshi, Sam Liu
APWEB
2003
Springer
13 years 10 months ago
Extracting Content Structure for Web Pages Based on Visual Representation
Abstract. A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and auto...
Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma
WWW
2006
ACM
14 years 5 months ago
Robust web content extraction
We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative ...
Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kacz...
IJCAI
2003
13 years 6 months ago
Expressive Power of Tree and String Based Wrappers
There exist two types of wrappers: the string based wrapper such as the LR wrapper, and the tree based wrapper. A tree based wrapper designates extraction regions by nodes on the ...
Daisuke Ikeda, Yasuhiro Yamada, Sachio Hirokawa