Sciweavers

EMNLP
2008

Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model

13 years 5 months ago
Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model
Parallel web pages are important source of training data for statistical machine translation. In this paper, we present a new approach to sentence alignment on parallel web pages. Parallel web pages tend to have parallel structuresand the structural correspondence can be indicative information for identifying parallel sentences. In our approach, the web page is represented as a tree, and a stochastic tree alignment model is used to exploit the structural correspondence for sentence alignment. Experiments show that this method significantly enhances alignment accuracy and robustness for parallel web pages which are much more diverse and noisy than standard parallel corpora such as "Hansard". With improved sentence alignment performance, web mining systems are able to acquire parallel sentences of higher quality from the web.
Lei Shi, Ming Zhou
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where EMNLP
Authors Lei Shi, Ming Zhou
Comments (0)