We introduce a new method to improve web site text content by identifying the most relevant free text in the web pages. In order to understand the variations in web page text, we c...
Text clustering is potentially very useful for exploration of text sets that are too large to study manually. The success of such a tool depends on whether the results can be expl...
We describe a segmentation method and associated file format for storing images of color documents. We separate each page of the document into three layers, containing the backgro...
Daniel P. Huttenlocher, Pedro F. Felzenszwalb, Wil...
Author identification is a text categorization task with applications in intelligence, criminal law, computer forensics, etc. Usually, in such cases there is shortage of training t...
Abstract. We present a model for complex documents possibly consisting of a hierarchically structured set of images or texts. Documents are represented both at the form level (as s...
Carlo Meghini, Fabrizio Sebastiani, Umberto Stracc...