In many text retrieval tasks, it is highly desirable to obtain a "similarity profile" of the document collection for a given query. We propose sampling-based techniques ...
We introduce a Bayesian model, BayesANIL, that is capable of estimating uncertainties associated with the labeling process. Given a labeled or partially labeled training corpus of...
Clustering by document concepts is a powerful way of retrieving information from a large number of documents. This task in general does not make any assumption on the data distrib...
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However, spherical k-means ca...
Commercial OCR packages work best with highquality scanned images. They often produce poor results when the image is degraded, either because the original itself was poor quality,...