Automatic extraction of titles from general documents using machine learning

13 years 10 months ago

Download research.microsoft.com

In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of speciﬁc genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Oﬃce including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from g...

Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Q

Real-time Traffic

Automatic Title Extraction | General Documents | JCDL 2005 | Title Extraction |

claim paper

» Web page title extraction and its application

» Automatic Extraction of Textual Elements from News Web Pages

» SciPlore Xtract Extracting Titles from Scientific PDF Documents by Analyzing Style Informa...

» Automatic Document Metadata Extraction Using Support Vector Machines

» Using titles and category names from editordriven taxonomies for automatic evaluation

» Text categorization by boosting automatically extracted concepts

» Visual information extraction

» Clicked phrase document expansion for sponsored search ad retrieval

Post Info
More Details (n/a)

Added	26 Jun 2010
Updated	26 Jun 2010
Type	Conference
Year	2005
Where	JCDL
Authors	Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Qinghua Zheng

Comments (0)

Sciweavers

Automatic extraction of titles from general documents using machine learning

Automatic Title Extraction | General Documents | JCDL 2005 | Title Extraction |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers