We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is de...
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to de...
Background: With the increased availability of high throughput data, such as DNA microarray data, researchers are capable of producing large amounts of biological data. During the...
Topic hierarchies are very useful for managing, searching and browsing large repositories of text documents. The hierarchical clustering methods are used to support the constructi...
Abstract. We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decompositio...
Holger Bast, Georges Dupret, Debapriyo Majumdar, B...