Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

16 years 1 months ago

Download www.aifb.uni-karlsruhe.de

Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientiﬁc databases. We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different conﬁgurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a signiﬁcant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can hav...

Sebastian Blohm, Philipp Cimiano

Real-time Traffic