SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)

14 years 8 months ago

Download www.sciplore.org

Extracting titles from a PDFs full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDFs title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ,,academic search engine scenario and better run times (8:19 minutes vs. 57:26 minutes).

Jöran Beel, Bela Gipp, Ammar Shaker, Nick Fri

Real-time Traffic

Conditional Random Fields | Education | ERCIMDL 2010 | Machine | Support Vector Machine |

claim paper

Post Info
More Details (n/a)

Added	02 Mar 2011
Updated	02 Mar 2011
Type	Journal
Year	2010
Where	ERCIMDL
Authors	Jöran Beel, Bela Gipp, Ammar Shaker, Nick Friedrich

Comments (0)

Sciweavers

SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)

Conditional Random Fields | Education | ERCIMDL 2010 | Machine | Support Vector Machine |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers