GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

14 years 9 months ago

Download elvis.slis.indiana.edu

The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a distance 6. One of the most efficient algorithms for processing similarity-joins is the Multidimensional-Spatial Join (MSJ) by Koudas and Sevcik. In our previous work -- pursued for the two-dimensional case -- we found however that MSJ has several performance shortcomings in terms of CPU and I/O cost as well as memory-requirements. Therefore, MSJ is not generally applicable to high-dimensional data. In this paper, we propose a new algorithm named Generic External Space Sweep (GESS). GESS introduces a modest rate of data replication to reduce the number of expensive distance computations. We present a new cost-model for replication, an I/O model, and an inexpensive method for duplicate removal. The principal component of our algorithm is a highly flexible replication engine. Our analytical model predicts a tremen...

Jens-Peter Dittrich, Bernhard Seeger

Real-time Traffic