Tensor space model for document analysis

13 years 10 months ago

Download www.cs.uiuc.edu

Vector Space Model (VSM) has been at the core of information retrieval for the past decades. VSM considers the documents as vectors in high dimensional space. In such a vector space, techniques like Latent Semantic Indexing (LSI), Support Vector Machines (SVM), Naive Bayes, etc., can be then applied for indexing and classiﬁcation. However, in some cases, the dimensionality of the document space might be extremely large, which makes these techniques infeasible due to the curse of dimensionality. In this paper, we propose a novel Tensor Space Model for document analysis. We represent documents as the second order tensors, or matrices. Correspondingly, a novel indexing algorithm called Tensor Latent Semantic Indexing (TensorLSI) is developed in the tensor space. Our theoretical analysis shows that TensorLSI is much more computationally eﬃcient than the conventional Latent Semantic Indexing, which makes it applicable for extremely large scale data set. Several experimental results on ...

Deng Cai, Xiaofei He, Jiawei Han

Real-time Traffic