Sciweavers

WWW
2006
ACM

Robust web content extraction

14 years 4 months ago
Robust web content extraction
We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones. Categories and Subject Descriptors H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia ? architectures, navigation, theory. General Terms Measurement, Performance, Reliability, Experimentation Keywords Content Extraction, Robustness, Wrappers, Evaluation.
Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kacz
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2006
Where WWW
Authors Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kaczmarek, Witold Abramowicz
Comments (0)