Sciweavers

WIDM
2003
ACM

Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

13 years 8 months ago
Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites
The advent of e-commerce has created a trend that brought thousands of catalogs online. Most of these websites are “taxonomy-directed”. A Web site is said to be ``taxonomydirected'' if it contains at least one taxonomy for organizing its contents and it presents the instances belonging to a category in a regular fashion. This paper describes the DataRover system, which can automatically crawl and extract products from taxonomy-directed online catalogs. DataRover utilizes heuristic rules to discover the structural regularities among: taxonomy segments, list-of-product and single-product pages and it uses these regularities to turn the online catalogs into a database of categorized products without the need for user interaction or the wrapper maintenance burden. We provide experimental results to demonstrate the efficacy of the DataRover and point to its current limitations. Categories and Subject Descriptors H.m [Information Systems]: Miscellaneous. General Terms Algorith...
Hasan Davulcu, S. Koduri, Saravanakumar Nagarajan
Added 05 Jul 2010
Updated 05 Jul 2010
Type Conference
Year 2003
Where WIDM
Authors Hasan Davulcu, S. Koduri, Saravanakumar Nagarajan
Comments (0)