Effective Web data extraction with standard XML technologies

10 years 11 months ago
Effective Web data extraction with standard XML technologies
We discuss the problem of Web data extraction and describe an XML-based methodology whose goal extends far beyond simple "screen scraping." An ideal data extraction process is able to digest target Web databases that are visible only as HTML pages, and create a local, identical replica of those databases as a result. What is needed in this process is much more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process needs to deal with such roadblocks such as session identifiers, HTML forms, and client-side JavaScript, and data integration problems such as incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires a solid data validation and error recovery service to handle data extraction failures, which are unavoidable. In this paper we describe ANDES, a software framework that makes significant advances in solving these problems and provides a platform for building a production-quality Web ...
Jussi Myllymaki
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2001
Where WWW
Authors Jussi Myllymaki
Comments (0)