An increasing amount of Web data is accessible only by filling out HTML forms to query an underlying data source. While this is most welcome from a user perspective (queries are e...
Thepaper deals with investigations concerning potential structures of documentsthat will be subject to automated information extraction. The focus is on folding principles and the...
Over the past decade the Internet has evolved into the largest public community in the world. It provides a wealth of data content and services in almost every field of science, t...
We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the ...
We present two machine learning approaches to information extraction from semi-structured documents that can be used if no annotated training data are available, but there does ex...