Using visual cues for extraction of tabular data from arbitrary HTML documents

16 years 6 months ago

Download www.dbai.tuwien.ac.at

We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.

Bernhard Krüpl, Marcus Herzog, Wolfgang Gatte

Real-time Traffic

HTML Table Element | Internet Technology | Positional Data | Tabular Data | WWW 2005 |

claim paper

» Web Ecology Recycling HTML Pages as XML Documents Using W4F

» Discovering informative content blocks from Web documents

» Objectlevel document analysis of PDF files

» Recognition of Common Areas in a Web Page Using Visual Information a possible application ...

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2005
Where	WWW
Authors	Bernhard Krüpl, Marcus Herzog, Wolfgang Gatterbauer

Comments (0)

Sciweavers

Using visual cues for extraction of tabular data from arbitrary HTML documents

HTML Table Element | Internet Technology | Positional Data | Tabular Data | WWW 2005 |

Explore & Download

Productivity Tools

Sciweavers