Assessing user simulation for dialog systems using human judges and automatic evaluation measures

HUA AI; DIANE LITMAN

doi:10.1017/S1351324910000318

Assessing user simulation for dialog systems using human judges and automatic evaluation measures

Published online by Cambridge University Press: 01 February 2011

HUA AI and

DIANE LITMAN

Show author details

HUA AI: Affiliation:
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA e-mail: hua@cs.pitt.edu, litman@cs.pitt.edu, iamhuaai@gmail.com
DIANE LITMAN: Affiliation:
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA e-mail: hua@cs.pitt.edu, litman@cs.pitt.edu, iamhuaai@gmail.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

While different user simulations are built to assist dialog system development, there is an increasing need to quickly assess the quality of the user simulations reliably. Previous studies have proposed several automatic evaluation measures for this purpose. However, the validity of these evaluation measures has not been fully proven. We present an assessment study in which human judgments are collected on user simulation qualities as the gold standard to validate automatic evaluation measures. We show that a ranking model can be built using the automatic measures to predict the rankings of the simulations in the same order as the human judgments. We further show that the ranking model can be improved by using a simple feature that utilizes time-series analysis.

Type: Articles
Information: Natural Language Engineering , Volume 17 , Issue 4 , October 2011 , pp. 511 - 540

DOI: https://doi.org/10.1017/S1351324910000318 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ai, H. 2009. User Simulation for Spoken Dialog System Development. Ph.D. Dissertation, University of Pittsburgh.CrossRef Google Scholar

Ai, H. and Litman, D. 2006. Comparing real-real, simulated-Simulated, and simulated-real spoken dialog corpora. In Proceedings of the AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems, Boston, USA.Google Scholar

Ai, H. and Litman, D. 2007. Knowledge consistent user simulations for dialog systems. In Proceedings of Interspeech 2007, Antwerp, Belgium.Google Scholar

Ai, H. and Litman, D. 2008. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings Of 46th Annual Conference of Association of Computational Linguistics, Columbus, USA.Google Scholar

Ai, H. and Litman, D. 2009. Setting up user action probabilities in user simulations for dialog system development. In Proceedings Of 47th Annual Conference of Association of Computational Linguistics, Suntec, Singapore.Google Scholar

Ai, H., Tetreault, J. and Litman, D. 2007. Comparing user simulation models for dialog strategy learning. In Proceedings Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Rochester, USA.Google Scholar

Ai, H. and Weng, F. 2008. User simulation as testing for spoken dialog systems. In Proceedings Of 9th SIGDial Workshop on Discourse and Dialogue, Columbus, USA.Google Scholar

Bangalore, S., Rambow, O. and Whittaker, S. 2000. Evaluation metrics for generation. In Proceedings of the First International Natural Language Generation Conference, Mitzpe Ramon, Israel.Google Scholar

Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. 2004. Confidence estimation for machine translation. In John Hopkins Summer Workshop on Machine Translation, Maryland, USA.Google Scholar

Cen, H., Koedinger, K. and Junker, B. 2006. Learning factors analysis - a general method for cognitive model evaluation and improvement. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan.Google Scholar

Chung, G. 2004. Developing a flexible spoken dialog system using simulation. In Proceedings Of 42nd Annual Conference of Association of Computational Linguistics, Barcelona, Spain.Google Scholar

Chung, G., Seneff, S. and Wang, C. 2005. Automatic induction of language model data for a spoken dialogue system. In Proceedings Of 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal.Google Scholar

Cuayahuitl, H., Renals, S. and Lemon, O. 2005. Human-computer dialogue simulation using hidden Markov models. In Proceedings Of 2005 IEEE workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico.Google Scholar

Dawes, J. 2008. Do data characteristics change according to the number of scale points used? An experiment using 5-point, 7-point and 10-point scales. International Journal of Market Research 50 (1): 61–77.CrossRef Google Scholar

Di Eugenio, B., Jordan, P., Thomason, R., and Moore, J. 2000. The agreement process: an empirical investiation of human-human computer-mediated collaborative dialogues. International Jounrnal of Human-Computer Studies 53 (6): 1017–1076.CrossRef Google Scholar

Engelbrecht, K., Quade, M. and Möller, S. 2009. Analysis of a new simulation approach to dialog system evaluation. Speech Communnication 51 (12): 1234–1252.CrossRef Google Scholar

Foster, M. E. 2008. Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of the 5th International Natural Language Generation Conference, Salt Fork, USA.Google Scholar

Freund, Y., Iyer, R., Schapire, R. E. and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4: 933–969.Google Scholar

Gandhe, S. and Traum, D. 2008. Evaluation understudy for dialogue coherence models. In Prodeedings of 9th SIGdial Workshop on Discourse and Dialogue, Columbus, USA.Google Scholar

Georgila, K., Wolters, M. and Moore, J. 2008. Simulating the behaviour of older versus younger users when interacting with spoken dialog systems. In Proceedings of 46th Annual Conference of Association of Computational Linguistics, Columbus, USA.Google Scholar

Gordon, J. and Passoneau, R. 2010. An evaluation framework for natural language understanding in spoken dialogue systems. In Proceedings of the Seventh conference on International Language Resources and Evaluation, Valletta, Malta.Google Scholar

Grice, H. P. 1975. Logic and conversation. Syntax and Semantics III: Speech Acts 3: 41–58.Google Scholar

Harman, D. and Over, P. 2004. The Effects of Human Variation in DUC Summarization Evaluation. I In Proceedings of ACL-04 Workshop: Text Summarization Braches Out, Barcelona, Spain.Google Scholar

Hof, A., Hagen, E. and Huber, A. 2009. Adaptive help for speech dialogue systems based on learning and forgetting of speech commands. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Sydney, Australia.Google Scholar

Huang, X., Alleva, F., Hon, H.-w., Hwang, M.-y., and Rosenfeld, R. 1993. The SPHINX-II speech recognition system: an overview. Computer, Speech and Language 7 (2): 137–148.CrossRef Google Scholar

Janarthanam, S. and Lemon, O. 2008. User simulations for online adaptation and knowledge-alignment in Troubleshooting dialog systems. In Proceedings of the 12th SEMdial Workshop on on the Semantics and Pragmatics of Dialogs, London, UK.Google Scholar

Janarthanam, S. and Lemon, O. 2009. A two-tier user simulation model for reinforcement learning of adaptive referring expression generation policies. In Proceedings of the 10th SIGdial Workshop on on Discourse and Dialogue, London, UK.Google Scholar

Jung, S., Lee, C., Kim, K., Jeong, M., and Lee, G. 2009. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech & Language 23 (4): 479–509.CrossRef Google Scholar

Komatani, K., Kawahara, T. and Okuno, H. G. 2007. Analyzing temporal transition of real user's behaviors in a spoken dialogue system. In Proceedings of Interspeech 2007, Pittsburgh, USA.Google Scholar

Levin, E., Pieraccini, R. and Eckert, W. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. on Speech and Audio Processing 8 (1): 11–23.CrossRef Google Scholar

Lin, C. Y. 2004. ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain.Google Scholar

Linguistic Data Consortium 2005. Linguistic Data Annotation Specification: Assessment of Fluency and Adequacy in Translations.Google Scholar

Litman, D. and Silliman, S. 2004. ITSPOKE: an intelligent tutoring spoken dialog system. In Proceedings of the Human Language Technology: NAACL, Boston, USA.Google Scholar

Litman, D., Rośe, C., Forbes-Riley, K., VanLehn, K., Bhembe, D., and Silliman, S. 2006. Spoken versus typed human and computer dialogue tutoring. International Journal of Artificial Intelligence in Education 16: 145–170.Google Scholar

Lou, Y., Abrami, P. C. and d'pollonia, S. 2001. Small group and individual learning with technology: a meta-analysis. Review of Educational Research 71 (3): 449–521.CrossRef Google Scholar

Mairesse, F., Walker, M., Mehl, M. and Moore, R. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30: 457–500.CrossRef Google Scholar

Papineni, K. A., Roukos, S., Ward, R. T. and Zhu, W-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Conference of Association of Computational Linguistics, Philadelphia, USA.Google Scholar

Rieser, V. and Lemon, O. 2006. Cluster-based user simulations for learning dialog strategies. In Proceedings of Interspeech 2006, Pittsburgh, PA.Google Scholar

Rieser, V. and Lemon, O. 2010. Natural language generation as planning under uncertainty for spoken dialogue systems. In Krahmer, E., and Theune, M. (eds.), Empirical Methods in Natural Language Generation, volume 5980 of Lecture Notes in Computer Science.CrossRef Google Scholar

Rotaru, M. 2008. Applications of Discourse Structure for Spoken Dialog Systems. Ph.D. Dissertation, University of Pittsburgh.Google Scholar

Schatzmann, J., Georgila, K. and Young, S. 2005. Quantitative evaluation of user simulation techniques for spoken dialog systems. In Proceedings of 6th SIGdial Workshop on Discourse and Dialog, Lisbon, Portugal.Google Scholar

Schatzmann, J. and Young, S. 2009. The hidden agenda user simulation model. IEEE Trans. Audio, Speech and Language Processing 17 (4): 733–747.CrossRef Google Scholar

Scheffler, K. and Young, S. 2001. Corpus-based dialog simulation for automatic strategy learning and evaluation. In Proceedings of NAACL-2001 Workshop on Adaptation in Dialog Systems, Pittsburgh, PA.Google Scholar

VanLehn, K., Jordan, P. W., Roé, C. P., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A., Siler, S., Srivastava, R., and Wilson, R. 2002. The architecture of Why2-Atlas: a coach for qualitative physics essay writing. In Proceedings of Intelligent Tutoring Systems Conference, Biarritz, France.Google Scholar

Walker, M., Rambow, O. and Rogati, M. 2001. SPoT: a trainable sentence planner. In Proceedings of NAACL 2001, Pittsburgh, USA.Google Scholar

Walker, M., Whittaker, S., Stent, A., Maloor, P., Moore, J., Johnston, M., and Vasireddy, G. 2005. Generation and evaluation of user tailored responses in multimodal dialog. Cognitive Science 28: 811–840.CrossRef Google Scholar

Williams, J. 2008. Evaluating user simulations with the Cramer-von Mises divergence. Speech Communication 50 (10): 829–846.CrossRef Google Scholar

Article contents

Assessing user simulation for dialog systems using human judges and automatic evaluation measures

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests