Bootstrapping Information Extraction from Field Books

9 years 2 months ago
Bootstrapping Information Extraction from Field Books
We present two machine learning approaches to information extraction from semi-structured documents that can be used if no annotated training data are available, but there does exist a database filled with information derived from the type of documents to be processed. One approach employs standard supervised learning for information extraction by artificially constructing labelled training data from the contents of the database. The second approach combines unsupervised Hidden Markov modelling with language models. Empirical evaluation of both systems suggests that it is possible to bootstrap a field segmenter from a database alone. The combination of Hidden Markov and language modelling was found to perform best at this task.
Sander Canisius, Caroline Sporleder
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Authors Sander Canisius, Caroline Sporleder
Comments (0)