On Finding Similar Items in a Stream of Transactions

14 years 11 months ago

Download www.itu.dk

While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-itemset must use space (min{mb, nk , (mb/)k }) bits, where mb is the number of items in the stream so far, n is the number of distinct items and is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worstcase assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similari...

Andrea Campagna, Rasmus Pagh

Real-time Traffic