Sciweavers

ITCC
2003
IEEE

A Method for Calculating Term Similarity on Large Document Collections

13 years 9 months ago
A Method for Calculating Term Similarity on Large Document Collections
We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is defined using the Expected Mutual Information Measure (EMIM). Since our aim for defining the similarity lists is to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an IR engine designed to use the similarity lists. Two methods were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the list generated by the Quadtree Heuristic was commensurate with the brute force list. 1 Background To facilitate the retrieval of OCR documents, the Information Science Research Institute has begun construction of a retrieval system called Hairetes [7]. Hairetes enhances a traditional retrieval system by incorporating the technique of Retrieval by General Logical Imaging (RbGLI) as developed by Crestani and Van Rijsber...
Wolfgang W. Bein, Jeffrey S. Coombs, Kazem Taghva
Added 04 Jul 2010
Updated 04 Jul 2010
Type Conference
Year 2003
Where ITCC
Authors Wolfgang W. Bein, Jeffrey S. Coombs, Kazem Taghva
Comments (0)