A Method for Calculating Term Similarity on Large Document Collections

15 years 4 months ago

Download www.isri.unlv.edu

We present an efﬁcient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is deﬁned using the Expected Mutual Information Measure (EMIM). Since our aim for deﬁning the similarity lists is to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an IR engine designed to use the similarity lists. Two methods were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the list generated by the Quadtree Heuristic was commensurate with the brute force list. 1 Background To facilitate the retrieval of OCR documents, the Information Science Research Institute has begun construction of a retrieval system called Hairetes [7]. Hairetes enhances a traditional retrieval system by incorporating the technique of Retrieval by General Logical Imaging (RbGLI) as developed by Crestani and Van Rijsber...

Wolfgang W. Bein, Jeffrey S. Coombs, Kazem Taghva

Real-time Traffic