Characterising the difference

16 years 8 months ago

Download www.cs.uu.nl

Characterising the differences between two databases is an often occurring problem in Data Mining. Detection of change over time is a prime example, comparing databases from two branches is another one. The key problem is to discover the patterns that describe the difference. Emerging patterns provide only a partial answer to this question. In previous work, we showed that the data distribution can be captured in a pattern-based model using compression [12]. Here, we extend this approach to define a generic dissimilarity measure on databases. Moreover, we show that this approach can identify those patterns that characterise the differences between two distributions. Experimental results show that our method provides a wellfounded way to independently measure database dissimilarity that allows for thorough inspection of the actual differences. This illustrates the use of our approach in real world data mining. Categories and Subject Descriptors H.2.8. Data Mining; I.5.4. Similarity Mea...

Jilles Vreeken, Matthijs van Leeuwen, Arno Siebes

Real-time Traffic