A Wikipedia-Based Multilingual Retrieval Model

15 years 6 months ago

Download www.uni-weimar.de

This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document d i chosen from the "L-subset" of Wikipedia. Likewise, for a second document d written in language L , L = L , we construct a concept vector d , using from the L -subset of the Wikipedia the topic-aligned counterparts d i of our previously chosen documents. Since the two concept vectors d and d are collection-relative representations of d and d they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from...

Martin Potthast, Benno Stein, Maik Anderka

Real-time Traffic