Sciweavers

PODS
2008
ACM

The power of two min-hashes for similarity search among hierarchical data objects

14 years 4 months ago
The power of two min-hashes for similarity search among hierarchical data objects
In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sen...
Sreenivas Gollapudi, Rina Panigrahy
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2008
Where PODS
Authors Sreenivas Gollapudi, Rina Panigrahy
Comments (0)