Not so creepy crawler: easy crawler generation with standard xml queries

13 years 11 months ago

Download www2.pms.ifi.lmu.de

Web crawlers are increasingly used for focused tasks such as the extraction of data from Wikipedia or the analysis of social networks like last.fm. In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis. In this demonstration, we present a focused, structurebased crawler generator, the“Not so Creepy Crawler”(nc2 ). What sets nc2 apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together suﬃce to realize a wide variety of focused crawlers. We demonstrate nc2 with two applications: The ﬁrst extracts data about cities from Wikipedia with a customizable s...

Franziska von dem Bussche, Klara A. Weiand, Benedi

Real-time Traffic

Data Extraction | Internet Technology | Precise Data Extraction | Web Crawler | WWW 2010 |

claim paper

Post Info
More Details (n/a)

Added	14 May 2010
Updated	14 May 2010
Type	Conference
Year	2010
Where	WWW
Authors	Franziska von dem Bussche, Klara A. Weiand, Benedikt Linse, Tim Furche, François Bry

Comments (0)

Sciweavers

Not so creepy crawler: easy crawler generation with standard xml queries

Data Extraction | Internet Technology | Precise Data Extraction | Web Crawler | WWW 2010 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers