Sciweavers

4 search results - page 1 / 1
» Cleaneval: a Competition for Cleaning Web Pages
Sort
View
LREC
2008
138views Education» more  LREC 2008»
13 years 6 months ago
Cleaneval: a Competition for Cleaning Web Pages
Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and lang...
Marco Baroni, Francis Chantree, Adam Kilgarriff, S...
LREC
2008
108views Education» more  LREC 2008»
13 years 6 months ago
A Lightweight and Efficient Tool for Cleaning Web Pages
Originally conceived as a "naive" baseline experiment using traditional n-gram language models as classifiers, the NCLEANER system has turned out to be a fast and lightw...
Stefan Evert
WWW
2009
ACM
14 years 5 months ago
Extracting article text from the web with maximum subsequence segmentation
Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content...
Jeff Pasternack, Dan Roth
WSDM
2010
ACM
215views Data Mining» more  WSDM 2010»
14 years 2 months ago
Boilerplate Detection using Shallow Text Features
In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, ma...
Christian Kohlschütter, Peter Fankhauser, Wol...