Resource selection for domain-specific cross-lingual IR

9 years 2 months ago
Resource selection for domain-specific cross-lingual IR
An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources...
Monica Rogati, Yiming Yang
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Authors Monica Rogati, Yiming Yang
Comments (0)