Fast approximate hierarchical clustering using similarity heuristics

8 years 7 months ago
Fast approximate hierarchical clustering using similarity heuristics
Background: Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity poses problems that cannot be overcome by simply waiting for faster computers. Results: We propose an approximate AHC algorithm HappieClust which can output a biologically meaningful clustering of a large dataset more than an order of magnitude faster than full AHC algorithms. The key to the algorithm is to limit the number of calculated pairwise distances to a carefully chosen subset of all possible distances. We choose distances using a similarity heuristic based on a small set of pivot objects. The heuristic efficiently finds pairs of similar objects and these help to mimic the greedy choices of full AHC. Quality of approximate AHC as compared to full AHC is studied with three measures. The...
Meelis Kull, Jaak Vilo
Added 08 Dec 2010
Updated 08 Dec 2010
Type Journal
Year 2008
Authors Meelis Kull, Jaak Vilo
Comments (0)