Web page title extraction and its application

13 years 4 months ago

Download research.microsoft.com

This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a ‗definition‘ on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (Direct Object Model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6%-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We us...

Yewei Xue, Yunhua Hu, Guomao Xin, Ruihua Song, Shu

Real-time Traffic

HTML Documents | HTML Titles | IPM 2007 | Web Page |

claim paper

» Syntactic Folding and its Application to the Information Extraction from Web Pages

» Extracting Structured Data from Web Pages

» Generating succinct titles for web URLs

» Is this a good title

» Generating Research Websites Using Summarisation Techniques

» Intelligent Content Based Title and Author Name Extraction from Formatted Documents

» Harnessing the wisdom of the crowds for accurate web page clipping

» Automatic Extraction of Textual Elements from News Web Pages

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2007
Where	IPM
Authors	Yewei Xue, Yunhua Hu, Guomao Xin, Ruihua Song, Shuming Shi, Yunbo Cao, Chin-Yew Lin, Hang Li

Comments (0)

Sciweavers

Web page title extraction and its application

HTML Documents | HTML Titles | IPM 2007 | Web Page |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers