Sciweavers

ICDAR
2005
IEEE

Document Understanding System Using Stochastic Context-Free Grammars

13 years 9 months ago
Document Understanding System Using Stochastic Context-Free Grammars
We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.
John C. Handley, Anoop M. Namboodiri, Richard Zani
Added 24 Jun 2010
Updated 24 Jun 2010
Type Conference
Year 2005
Where ICDAR
Authors John C. Handley, Anoop M. Namboodiri, Richard Zanibbi
Comments (0)