An ecological-cognitive framework of analysis and a model-tracing architecture are presented and used in the analysis of data recorded from users browsing a large document collect...
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work i...
With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since parti...
The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n...
A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is r...