Sciweavers

IRFC
2011
Springer

Multilingual Document Clustering Using Wikipedia as External Knowledge

12 years 7 months ago
Multilingual Document Clustering Using Wikipedia as External Knowledge
This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia, a structured multilingual knowledge base, has been highly exploited in many monolingual clustering approaches and also in comparing multilingual corpora. But there is no prior work which studied the impact of Wikipedia on MDC. Here, we have made an in-depth study on availing Wikipedia in enhancing MDC performance. We tried to utilize its knowledge structure (Crosslingual links, Category, Outlinks, Infobox information, etc.) to enrich the document representation for clustering multilingual documents. By avoiding language-specific tools, this approach has become a general framework which can be easily extensible to other languages. We have experimented with bisecting k-means clustering algorithm on a standard dataset provided by FIRE1 for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. We have considered English and Hindi datasets. The system is evaluated using F-scor...
N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma
Added 30 Aug 2011
Updated 30 Aug 2011
Type Journal
Year 2011
Where IRFC
Authors N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma
Comments (0)