We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of reg...
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and sto...
The success of a software project is largely dependent upon the quality of the Software Requirements Specification (SRS) document, which serves as a medium to communicate user req...
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
This paper describes experiments in the automatic construction of lexicons that would be useful in searching large document collections for text fragments that address a specific ...