Scaling up duplicate detection in graph data

13 years 6 months ago

Download www.hpi.uni-potsdam.de

Duplicate detection determines different representations of realworld objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. We scale up duplicate detection in graph data (DDG) to large amounts of data using the support of a relational database system. We first generalize the process of DDG and then present how to scale DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data. Categories and Subject Descriptors H.4 [In...

Melanie Herschel, Felix Naumann

Real-time Traffic