Learning to extract form labels

9 years 9 months ago
Learning to extract form labels
In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a well-defined domain, automatically extracting these labels is a challenging problem. Whereas previous approaches to this problem have relied on heuristics and manually specified extraction rules, our technique makes use of a learning classifier ensemble to identify element-label mappings; and it applies a reconciliation step which leverages the classifier-derived mappings to boost extraction accuracy. We present a detailed experimental evaluation using over three thousand Web forms. Our results show that our approach is effective: it obtains significantly higher accuracy and is more robust to variability in form layout than previous labe...
Hoa Nguyen, Thanh Hoang Nguyen, Juliana Freire
Added 28 Dec 2010
Updated 28 Dec 2010
Type Journal
Year 2008
Authors Hoa Nguyen, Thanh Hoang Nguyen, Juliana Freire
Comments (0)