Sciweavers

ACL
2008

Assessing Dialog System User Simulation Evaluation Measures Using Human Judges

13 years 6 months ago
Assessing Dialog System User Simulation Evaluation Measures Using Human Judges
Previous studies evaluate simulated dialog corpora using evaluation measures which can be automatically extracted from the dialog systems' logs. However, the validity of these automatic measures has not been fully proven. In this study, we first recruit human judges to assess the quality of three simulated dialog corpora and then use human judgments as the gold standard to validate the conclusions drawn from the automatic measures. We observe that it is hard for the human judges to reach good agreement when asked to rate the quality of the dialogs from given perspectives. However, the human ratings give consistent ranking of the quality of simulated corpora generated by different simulation models. When building prediction models of human judgments using previously proposed automatic measures, we find that we cannot reliably predict human ratings using a regression model, but we can predict human rankings by a ranking model.
Hua Ai, Diane J. Litman
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where ACL
Authors Hua Ai, Diane J. Litman
Comments (0)