A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Document

12 years 4 months ago

Download web2py.iiit.ac.in

Abstract. This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identiﬁed NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn’t make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE1 for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The sys...

N. Kiran Kumar, G. S. K. Santosh, Vasudeva Varma

Real-time Traffic