Web-scale knowledge extraction from semi-structured tables

15 years 2 months ago

Download www.patrickpantel.com

A wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table types over a large analysis of the Web and propose open challenges for extracting from attribute/value tables semantic triples (knowledge). We then describe a solution to a key problem in extracting semantic triples: protagonist detection, i.e., finding the subject of the table that often is not present in the table itself. In 79% of our Web tables, our method finds the correct protagonist in its top three returned candidates. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning – knowledge acquisition. General Terms Algorithms, Experimentation, Measurement. Keywords Information extraction, structured data, web tables, classification.

Eric Crestan, Patrick Pantel

Real-time Traffic