Robust web content extraction

14 years 4 months ago

Download www2006.org

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones. Categories and Subject Descriptors H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia ? architectures, navigation, theory. General Terms Measurement, Performance, Reliability, Experimentation Keywords Content Extraction, Robustness, Wrappers, Evaluation.

Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kacz

Real-time Traffic

Absolute Xpath Expressions | Absolute Xpath Ones | Internet Technology | Relative Xpath Expressions | WWW 2006 |

claim paper

» Hybrid semantic tagging for information extraction

» Expected Utility of Content Blocks in Web Content Extraction

» Partial Information Extraction Approach to Lightweight Integration on the Web

» Extracting Content Structure for Web Pages Based on Visual Representation

» A SOMBased Technique for a UserCentric Content Extraction and Classification of Web 20 wit...

» Extracting context to improve accuracy for HTML content extraction

» Automatic extraction of clickable structured web contents for name entity queries

» Web data mining exploring hyperlinks contents and usage data

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2006
Where	WWW
Authors	Marek Kowalkiewicz, Maria E. Orlowska, Tomasz Kaczmarek, Witold Abramowicz

Comments (0)

Sciweavers

Robust web content extraction

Absolute Xpath Expressions | Absolute Xpath Ones | Internet Technology | Relative Xpath Expressions | WWW 2006 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers