Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

13

CIKM
2011
Springer

favoriteEmaildiscussreport

191views Information Technology» more CIKM 2011»

Partial duplicate detection for large book collections

12 years 4 months ago

Partial duplicate detection for large book collections

Download www.cs.umass.edu

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as “unique words” and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the ﬂow of ideas in the book. By aligning the sequence of unique words from two books using the longest common subsequence (LCS) one can discover whether two books are duplicates. Experiments on several datasets show that DUPNIQ is more accurate than traditional methods for duplicate detection such as shingling and is fast. On a collection of 100K scanned English books DUPNIQ detects partial duplicates in 30 min using 350 cores and has prec...

Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha

Real-time Traffic

CIKM 2011 | Information Storage And Retrieval | Information Technology | Libraries Collection | Optical Character Recognition |

claim paper

Related Content

» Detecting CoDerivative Documents in Large Text Collections

» SpotSigs robust and efficient near duplicate detection in large web collections

» Efficient partialduplicate detection based on sequence matching

» LargeScale Duplicate Detection for Web Image Search

» Online duplicate document detection signature reliability in a dynamic retrieval environme...

» MyFinder nearduplicate detection for large image collections

» Partial Redesign of Java Software Systems Based on Clone Analysis

» Large Scale Parallel Document Mining for Machine Translation

» Essential deduplication functions for transactional databases in law firms

Post Info
More Details (n/a)

Added	13 Dec 2011
Updated	13 Dec 2011
Type	Journal
Year	2011
Where	CIKM
Authors	Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha

Comments (0)