ViPER: augmenting automatic information extraction with visual perceptions

10 years 5 months ago
ViPER: augmenting automatic information extraction with visual perceptions
In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. Hereby the extraction rules are generated automatically without any training or human interaction, by means of operating on the DOM tree respectively the flat tag token sequence of a single page. Our contribution to automatic data extraction through this paper is twofold. First, we identify and rank potential repetitive patterns with respect to the user’s visual perception of the Web page, well aware that location and size of matching elements within a Web page constitute important criteria for defining relevance. Second, matching subsequences of the pattern with the highest weightiness are aligned with global multiple sequence alignment techniques. Experimental results show that our system is able to achieve high accuracy in dis...
Kai Simon, Georg Lausen
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where CIKM
Authors Kai Simon, Georg Lausen
Comments (0)