Mining Large-scale Comparable Corpora from Chinese-English News Collections

15 years 24 days ago

Download www.aclweb.org

In this paper, we explore a CLIR-based approach to construct large-scale Chinese-English comparable corpora, which is valuable for translation knowledge mining. The initial source and target document sets are crawled from news website and standardized uniformly. Keywords are extracted from the source document firstly, and then the extracted keywords are translated and combined as query words through certain criteria to retrieve against the index created using target document set. Meanwhile, the mapping correlations between source and target documents are developed according to the value of similarity calculated by the retrieval tool. Two methods are evaluated to filter the comparable document pairs so as to ensure the quality of the comparable corpora. Experimental results indicate that our approach is effective on the construction of ChineseEnglish comparable corpora.

Degen Huang, Lian Zhao, Lishuang Li, Haitao Yu

Real-time Traffic

Chinese-English Comparable Corpora | COLING 2010 | Comparable Corpora | Computational Linguistics | Target Documents |

claim paper

Added	13 May 2011
Updated	13 May 2011
Type	Journal
Year	2010
Where	COLING
Authors	Degen Huang, Lian Zhao, Lishuang Li, Haitao Yu

Sciweavers

Mining Large-scale Comparable Corpora from Chinese-English News Collections

Chinese-English Comparable Corpora | COLING 2010 | Comparable Corpora | Computational Linguistics | Target Documents |

Explore & Download

Productivity Tools

Sciweavers