Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

84

EACL
2006
ACL Anthology

favoriteEmaildiscussreport

145views Natural Language Processing» more EACL 2006»

Comparing Automatic and Human Evaluation of NLG Systems

15 years 1 months ago

Comparing Automatic and Human Evaluation of NLG Systems

Download www.csd.abdn.ac.uk

We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NIST, BLEU, and ROUGE. We find that NIST scores correlate best (> 0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential, in particular where high-quality reference texts and only a small number of human evaluators are available. However, in general it is probably best for automatic evaluations to be supported by human-based evaluations, or at least by studies that demonstrate that a particular metric correlates well with human judgments in a give...

Anja Belz, Ehud Reiter

Real-time Traffic

Automatic Evaluations | EACL 2006 | Human | Natural Language Processing | Nlg Systems |

claim paper

Related Content

» Validating the webbased evaluation of NLG systems

» A Flexible Approach to Natural Language Generation for Disabled Children

» Extracting Parallel Fragments from Comparable Corpora for Datatotext Generation

» Intrinsic vs Extrinsic Evaluation Measures for Referring Expression Generation

» Finding Common Ground Towards a Surface Realisation Shared Task

» Using a Randomised Controlled Clinical Trial to Evaluate an NLG System

» Comparing Humans and Automatic Speech Recognition Systems in Recognizing Dysarthric Speech

» Probabilistic Generation of Weather Forecast Texts

» Mining the Correlation between Human and Automatic Evaluation at Sentence Level

Post Info
More Details (n/a)

Added	30 Oct 2010
Updated	30 Oct 2010
Type	Conference
Year	2006
Where	EACL
Authors	Anja Belz, Ehud Reiter

Comments (0)