Integrating Data and Probabilistically Structured Text Documents

13 years 10 months ago

Download wwwiti.cs.uni-magdeburg.de

Commercial, non-profit and public organizations are accumulating huge amounts of electronically available text documents. Although composed of unstructured texts, documents contained in archives such as annual reports to shareholders, medical patient records and public announcements often share an inherent, though undocumented structure. In order to enable information integration of text collections with related structured data sources, this inherent structure should be made explicit as detailed as possible. The goal of this study is the establishment of a methodology for the integration of text documents with structured records into a hyper-archive of application-specific entities. The text documents are of implicit structure which has been explicated by data mining techniques as proposed in the DIAsDEM framework for semantic tagging of domain-specific text documents. The result is a probabilistic DTD that serves as a basis for the matching of schemata and for the matching of data in...

Karsten Winkler, Myra Spiliopoulou

Real-time Traffic