In most IR clustering problems, we directly cluster the documents, working in the document space, using cosine similarity between documents as the similarity measure. In many real-world applications, however, we usually have knowledge on the word side and wish to transform this knowledge to the document (concept) side. In this paper, we provide a mechanism for this knowledge transformation. To the best of our knowledge, this is the first model for such type of knowledge transformation. This model uses a nonnegative matrix factorization model X = FSGT , where X is the worddocument semantic matrix, F is the posterior probability of a word belonging to a word cluster and represents knowledge in the word space, G is the posterior probability of a document belonging to a document cluster and represents knowledge in the document space, and S is a scaled matrix factor which provides a condensed view of X. We show how knowledge on words can improve document clustering, i.e, knowledge in the w...
Tao Li, Chris H. Q. Ding, Yi Zhang 0005, Bo Shao