Sensitivity of Automated MT Evaluation Metrics on Higher Quality MT Output: BLEU vs Task-Based Evaluation Methods

15 years 2 months ago

Download www.lrec-conf.org

We report the results of an experiment to assess the ability of automated MT evaluation metrics to remain sensitive to variations in MT quality as the average quality of the compared systems goes up. We compare two groups of metrics: those which measure the proximity of MT output to some reference translation, and those which evaluate the performance of some automated process on degraded MT output. The experiment shows that proximity-based metrics (such as BLEU) loose sensitivity as the scores go up, but performance-based metrics (e.g., Named Entity recognition from MT output) remain sensitive across the scale. We suggest a model for explaining this result, which attributes the stable sensitivity of performance-based metrics to measuring the cumulative functional effect of different language levels, while proximity-based metrics measure structural matches at a lexical level only and therefore miss higher-level errors that are more typical for better MT systems. Development of new auto...

Bogdan Babych, Anthony Hartley

Real-time Traffic

Education | LREC 2008 | MT Evaluation Metrics | MT Output | Performance-based Metrics |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Bogdan Babych, Anthony Hartley

Comments (0)

Sciweavers

Sensitivity of Automated MT Evaluation Metrics on Higher Quality MT Output: BLEU vs Task-Based Evaluation Methods

Education | LREC 2008 | MT Evaluation Metrics | MT Output | Performance-based Metrics |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers