Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

13

LREC
2008

favoriteEmaildiscussreport

160views Education» more LREC 2008»

Automatic Extraction of Textual Elements from News Web Pages

13 years 6 months ago

Automatic Extraction of Textual Elements from News Web Pages

Download www.lrec-conf.org

In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) classifier to extract the desired textual elements. The technique uses internal structural features of a webpage without relying on the Document Object Model to which many content authors fail to adhere. The classifier uses a set of features which rely on the length of text, the percentage of hypertext, etc. The resulting classifier is nearly perfect on previously unseen news pages from different sites. The proposed technique is successfully employed in Alzoa.com, which is the largest Arabic news aggregator on the web.

Hossam Ibrahim, Kareem Darwish, Abdel-Rahim Madany

Real-time Traffic

Education | LREC 2008 | Machine Learning Classification | Support Vector Machine | Textual Elements |

claim paper

Related Content

» A LayoutIndependent Web News Article Contents Extraction Method Based on Relevance Analysi...

» Webassisted annotation semantic indexing and search of television and radio news

» Timebased contextualizednews browser tcnb

» Identifying Story and Preview Images in News Web Pages

» Automatic web news extraction using tree edit distance

» Automatic Identification of Temporal Information in Tourism Web Pages

» Using Visual Features for FineGrained Genre Classification of Web Pages

» Text Mining Finding Nuggets in Mountains of Textual Data

» BlogBuster A Tool for Extracting Corpora from the Blogosphere

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Hossam Ibrahim, Kareem Darwish, Abdel-Rahim Madany

Comments (0)