Information extraction from Wikipedia: moving down the long tail

16 years 5 months ago

Download www.cs.washington.edu

Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in con

Fei Wu, Raphael Hoffmann, Daniel S. Weld

Real-time Traffic

Automatically-learned Subsumption Taxonomy | Data Mining | Insufficient Training Data | KDD 2008 | Self-supervised Information Extraction |

claim paper

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2008
Where	KDD
Authors	Fei Wu, Raphael Hoffmann, Daniel S. Weld

Comments (0)

Sciweavers

Information extraction from Wikipedia: moving down the long tail

Automatically-learned Subsumption Taxonomy | Data Mining | Insufficient Training Data | KDD 2008 | Self-supervised Information Extraction |

Explore & Download

Productivity Tools

Sciweavers