As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retri...
The explosive growth of multimedia data poses serious challenges to data storage, management and search. Efficient near-duplicate detection is one of the required technologies for...
Code duplication, plausibly caused by copying source code and slightly modifying it, is often observed in large systems. Clone detection and documentation have been investigated b...
Magdalena Balazinska, Ettore Merlo, Michel Dagenai...
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an init...
Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe ...
As massive document repositories and knowledge management systems continue to expand, in proprietary environments as well as on the Web, the need for duplicate detection becomes i...