Sciweavers

WWW
2005
ACM

Using visual cues for extraction of tabular data from arbitrary HTML documents

14 years 4 months ago
Using visual cues for extraction of tabular data from arbitrary HTML documents
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.
Bernhard Krüpl, Marcus Herzog, Wolfgang Gatte
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2005
Where WWW
Authors Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer
Comments (0)