Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the databas...
This paper describes an application of IR and text categorization methods to a highly practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO annotation...
Vast amounts of text on the Web are unstructured and ungrammatical, such as classified ads, auction listings, forum postings, etc. We call such text “posts.” Despite their in...
We propose a multiclass (MC) classification approach to text categorization (TC). To fully take advantage of both positive and negative training examples, a maximal figure-of-meri...
In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering&...
Inderjit S. Dhillon, Subramanyam Mallela, Rahul Ku...