Sciweavers

RIAO
2007

From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations

13 years 5 months ago
From Layout to Semantic: a Reranking Model for Mapping Web Documents to Mediated XML Representations
Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed met...
Guillaume Wisniewski, Patrick Gallinari
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2007
Where RIAO
Authors Guillaume Wisniewski, Patrick Gallinari
Comments (0)