Sciweavers

SIGMOD
2001
ACM

Automatic Segmentation of Text into Structured Records

14 years 4 months ago
Automatic Segmentation of Text into Structured Records
In this paper we present a method for automatically segmenting unformatted text records into structured elements. Several useful data sources today are human-generated as continuous text whereas convenient usage requires the data to be organized as structured records. A prime motivation is the warehouse address cleaning problem of transforming dirty addresses stored in large corporate databases as a single text field into subfields like "City" and "Street". Existing tools rely on hand-tuned, domain-specific rule-based systems. We describe a tool datamold that learns to automatically extract structure when seeded with a small number of training examples. The tool enhances on Hidden Markov Models (HMM) to build a powerful probabilistic model that corroborates multiple sources of information including, the sequence of elements, their length distribution, distinguishing words from the vocabulary and an optional external data dictionary. Experiments on real-life dataset...
Vinayak R. Borkar, Kaustubh Deshmukh, Sunita Saraw
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2001
Where SIGMOD
Authors Vinayak R. Borkar, Kaustubh Deshmukh, Sunita Sarawagi
Comments (0)