Sciweavers

290 search results - page 49 / 58
» Document normalization revisited
Sort
View
106
Voted
CIKM
2011
Springer
14 years 21 days ago
Probabilistic near-duplicate detection using simhash
This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algo...
Sadhan Sood, Dmitri Loguinov
94
Voted
CACM
2006
102views more  CACM 2006»
15 years 24 days ago
Infoglut
whose titles and abstracts sound very interesting, the pile of unread reports continues to grow on the table in my office." (How quaint the terminology: mail and electronic me...
Peter J. Denning
117
Voted
SIGIR
2002
ACM
15 years 11 days ago
Empirical studies in strategies for Arabic retrieval
This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Ara...
Jinxi Xu, Alexander Fraser, Ralph M. Weischedel
SIGIR
2010
ACM
15 years 4 months ago
Estimation of statistical translation models based on mutual information for ad hoc information retrieval
As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language m...
Maryam Karimzadehgan, ChengXiang Zhai
DRR
2003
15 years 2 months ago
Correcting OCR text by association with historical datasets
The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting ...
Susan E. Hauser, Jonathan Schlaifer, Tehseen F. Sa...