Sciweavers

AINA
2009
IEEE

CUTER: An Efficient Useful Text Extraction Mechanism

13 years 11 months ago
CUTER: An Efficient Useful Text Extraction Mechanism
In this paper we present CUTER, a system that processes HTML pages in order to extract the useful text from them. The mechanism is focalized on HTML pages that include news articles from major portals and blogs. As useful text we define the body of the article that contains the news report. In order to extract the body of the article we deconstruct the HTML page to its DOM model and we apply a set of algorithms in order to clean and correct the HTML code, locate and characterize each node of the DOM model and finally store the text from the nodes that are characterized as useful text nodes. CUTER is a subsystem of peRSSonal, a web tool that is used to obtain news articles from all over the world, process them and present them back to the end users in a personalized manner. The role of CUTER is to feed peRSSonal with the body of the articles that are collected from major news portals and blogs. In this paper we present the basic algorithms and experimental results on the efficiency of ...
George Adam, Christos Bouras, Vassilis Poulopoulos
Added 18 May 2010
Updated 18 May 2010
Type Conference
Year 2009
Where AINA
Authors George Adam, Christos Bouras, Vassilis Poulopoulos
Comments (0)