Sciweavers

AINA
2009
IEEE

Learning to Extract Content from News Webpages

13 years 10 months ago
Learning to Extract Content from News Webpages
We consider the problem of content extraction from online news webpages. To explore to what extent the syntactic markup and the visual structure of a webpage facilitate the extraction of its content, we compare two state-of-theart classifiers as first instantiations of a general framework that allows for proper model comparison. To this end, we introduce the publicly available NEWS600 corpus, a set of 604 real world news webpages which have been annotated with 30 semantic labels. An empirical analysis of the two models on this dataset shows that the inclusion of structural information is indeed advantageous.
Alex Spengler, Patrick Gallinari
Added 18 May 2010
Updated 18 May 2010
Type Conference
Year 2009
Where AINA
Authors Alex Spengler, Patrick Gallinari
Comments (0)