Sciweavers

JMLR
2007

Distances between Data Sets Based on Summary Statistics

13 years 4 months ago
Distances between Data Sets Based on Summary Statistics
The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well.
Nikolaj Tatti
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2007
Where JMLR
Authors Nikolaj Tatti
Comments (0)