Automatic web news extraction using tree edit distance

14 years 5 months ago

Download www.iw3c2.org

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites. Categories and Subject Descriptors H.3.m [Information Storage and Retrieval]: Mis...

Davi de Castro Reis, Paulo Braz Golgher, Altigran

Real-time Traffic

Internet Technology | Keywords Data Extraction | Largest Data Repository | Miscellaneous-Data Extraction | WWW 2004 |

claim paper

» NET A System for Extracting Web Data from Flat and Nested Data Records

» Learning Stochastic Tree Edit Distance

» Automatically Harvesting KatakanaEnglish Term Pairs from Search Engine Query Logs

» Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Informatio...

» Thresher automating the unwrapping of semantic content from the World Wide Web

» Homepage live automatic block tracing for web personalization

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2004
Where	WWW
Authors	Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares da Silva, Alberto H. F. Laender

Comments (0)

Sciweavers

Automatic web news extraction using tree edit distance

Internet Technology | Keywords Data Extraction | Largest Data Repository | Miscellaneous-Data Extraction | WWW 2004 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers