Sciweavers

ECIR
2006
Springer

Automatic Acquisition of Chinese-English Parallel Corpus from the Web

13 years 6 months ago
Automatic Acquisition of Chinese-English Parallel Corpus from the Web
Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. Previous systems used a single principle feature for parallel web page verification, whereas we use multiple features to identify parallel texts via a k-nearest-neighbor classifier. Our system was evaluated using a data set containing 6500 Chinese
Ying Zhang, Ke Wu, Jianfeng Gao, Phil Vines
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2006
Where ECIR
Authors Ying Zhang, Ke Wu, Jianfeng Gao, Phil Vines
Comments (0)