Skip to main content Accessibility help

Assessing user simulation for dialog systems using human judges and automatic evaluation measures

  • HUA AI (a1) and DIANE LITMAN (a1)


While different user simulations are built to assist dialog system development, there is an increasing need to quickly assess the quality of the user simulations reliably. Previous studies have proposed several automatic evaluation measures for this purpose. However, the validity of these evaluation measures has not been fully proven. We present an assessment study in which human judgments are collected on user simulation qualities as the gold standard to validate automatic evaluation measures. We show that a ranking model can be built using the automatic measures to predict the rankings of the simulations in the same order as the human judgments. We further show that the ranking model can be improved by using a simple feature that utilizes time-series analysis.



Hide All
Ai, H. 2009. User Simulation for Spoken Dialog System Development. Ph.D. Dissertation, University of Pittsburgh.
Ai, H. and Litman, D. 2006. Comparing real-real, simulated-Simulated, and simulated-real spoken dialog corpora. In Proceedings of the AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems, Boston, USA.
Ai, H. and Litman, D. 2007. Knowledge consistent user simulations for dialog systems. In Proceedings of Interspeech 2007, Antwerp, Belgium.
Ai, H. and Litman, D. 2008. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings Of 46th Annual Conference of Association of Computational Linguistics, Columbus, USA.
Ai, H. and Litman, D. 2009. Setting up user action probabilities in user simulations for dialog system development. In Proceedings Of 47th Annual Conference of Association of Computational Linguistics, Suntec, Singapore.
Ai, H., Tetreault, J. and Litman, D. 2007. Comparing user simulation models for dialog strategy learning. In Proceedings Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Rochester, USA.
Ai, H. and Weng, F. 2008. User simulation as testing for spoken dialog systems. In Proceedings Of 9th SIGDial Workshop on Discourse and Dialogue, Columbus, USA.
Bangalore, S., Rambow, O. and Whittaker, S. 2000. Evaluation metrics for generation. In Proceedings of the First International Natural Language Generation Conference, Mitzpe Ramon, Israel.
Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. 2004. Confidence estimation for machine translation. In John Hopkins Summer Workshop on Machine Translation, Maryland, USA.
Cen, H., Koedinger, K. and Junker, B. 2006. Learning factors analysis - a general method for cognitive model evaluation and improvement. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems, Jhongli, Taiwan.
Chung, G. 2004. Developing a flexible spoken dialog system using simulation. In Proceedings Of 42nd Annual Conference of Association of Computational Linguistics, Barcelona, Spain.
Chung, G., Seneff, S. and Wang, C. 2005. Automatic induction of language model data for a spoken dialogue system. In Proceedings Of 6th SIGdial Workshop on Discourse and Dialogue, Lisbon, Portugal.
Cuayahuitl, H., Renals, S. and Lemon, O. 2005. Human-computer dialogue simulation using hidden Markov models. In Proceedings Of 2005 IEEE workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico.
Dawes, J. 2008. Do data characteristics change according to the number of scale points used? An experiment using 5-point, 7-point and 10-point scales. International Journal of Market Research 50 (1): 6177.
Di Eugenio, B., Jordan, P., Thomason, R., and Moore, J. 2000. The agreement process: an empirical investiation of human-human computer-mediated collaborative dialogues. International Jounrnal of Human-Computer Studies 53 (6): 10171076.
Engelbrecht, K., Quade, M. and Möller, S. 2009. Analysis of a new simulation approach to dialog system evaluation. Speech Communnication 51 (12): 12341252.
Foster, M. E. 2008. Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of the 5th International Natural Language Generation Conference, Salt Fork, USA.
Freund, Y., Iyer, R., Schapire, R. E. and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4: 933969.
Gandhe, S. and Traum, D. 2008. Evaluation understudy for dialogue coherence models. In Prodeedings of 9th SIGdial Workshop on Discourse and Dialogue, Columbus, USA.
Georgila, K., Wolters, M. and Moore, J. 2008. Simulating the behaviour of older versus younger users when interacting with spoken dialog systems. In Proceedings of 46th Annual Conference of Association of Computational Linguistics, Columbus, USA.
Gordon, J. and Passoneau, R. 2010. An evaluation framework for natural language understanding in spoken dialogue systems. In Proceedings of the Seventh conference on International Language Resources and Evaluation, Valletta, Malta.
Grice, H. P. 1975. Logic and conversation. Syntax and Semantics III: Speech Acts 3: 4158.
Harman, D. and Over, P. 2004. The Effects of Human Variation in DUC Summarization Evaluation. I In Proceedings of ACL-04 Workshop: Text Summarization Braches Out, Barcelona, Spain.
Hof, A., Hagen, E. and Huber, A. 2009. Adaptive help for speech dialogue systems based on learning and forgetting of speech commands. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Sydney, Australia.
Huang, X., Alleva, F., Hon, H.-w., Hwang, M.-y., and Rosenfeld, R. 1993. The SPHINX-II speech recognition system: an overview. Computer, Speech and Language 7 (2): 137148.
Janarthanam, S. and Lemon, O. 2008. User simulations for online adaptation and knowledge-alignment in Troubleshooting dialog systems. In Proceedings of the 12th SEMdial Workshop on on the Semantics and Pragmatics of Dialogs, London, UK.
Janarthanam, S. and Lemon, O. 2009. A two-tier user simulation model for reinforcement learning of adaptive referring expression generation policies. In Proceedings of the 10th SIGdial Workshop on on Discourse and Dialogue, London, UK.
Jung, S., Lee, C., Kim, K., Jeong, M., and Lee, G. 2009. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech & Language 23 (4): 479509.
Komatani, K., Kawahara, T. and Okuno, H. G. 2007. Analyzing temporal transition of real user's behaviors in a spoken dialogue system. In Proceedings of Interspeech 2007, Pittsburgh, USA.
Levin, E., Pieraccini, R. and Eckert, W. 2000. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. on Speech and Audio Processing 8 (1): 1123.
Lin, C. Y. 2004. ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain.
Linguistic Data Consortium 2005. Linguistic Data Annotation Specification: Assessment of Fluency and Adequacy in Translations.
Litman, D. and Silliman, S. 2004. ITSPOKE: an intelligent tutoring spoken dialog system. In Proceedings of the Human Language Technology: NAACL, Boston, USA.
Litman, D., Rośe, C., Forbes-Riley, K., VanLehn, K., Bhembe, D., and Silliman, S. 2006. Spoken versus typed human and computer dialogue tutoring. International Journal of Artificial Intelligence in Education 16: 145170.
Lou, Y., Abrami, P. C. and d'pollonia, S. 2001. Small group and individual learning with technology: a meta-analysis. Review of Educational Research 71 (3): 449521.
Mairesse, F., Walker, M., Mehl, M. and Moore, R. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30: 457500.
Papineni, K. A., Roukos, S., Ward, R. T. and Zhu, W-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Conference of Association of Computational Linguistics, Philadelphia, USA.
Rieser, V. and Lemon, O. 2006. Cluster-based user simulations for learning dialog strategies. In Proceedings of Interspeech 2006, Pittsburgh, PA.
Rieser, V. and Lemon, O. 2010. Natural language generation as planning under uncertainty for spoken dialogue systems. In Krahmer, E., and Theune, M. (eds.), Empirical Methods in Natural Language Generation, volume 5980 of Lecture Notes in Computer Science.
Rotaru, M. 2008. Applications of Discourse Structure for Spoken Dialog Systems. Ph.D. Dissertation, University of Pittsburgh.
Schatzmann, J., Georgila, K. and Young, S. 2005. Quantitative evaluation of user simulation techniques for spoken dialog systems. In Proceedings of 6th SIGdial Workshop on Discourse and Dialog, Lisbon, Portugal.
Schatzmann, J. and Young, S. 2009. The hidden agenda user simulation model. IEEE Trans. Audio, Speech and Language Processing 17 (4): 733747.
Scheffler, K. and Young, S. 2001. Corpus-based dialog simulation for automatic strategy learning and evaluation. In Proceedings of NAACL-2001 Workshop on Adaptation in Dialog Systems, Pittsburgh, PA.
VanLehn, K., Jordan, P. W., Roé, C. P., Bhembe, D., Böttner, M., Gaydos, A., Makatchev, M., Pappuswamy, U., Ringenberg, M., Roque, A., Siler, S., Srivastava, R., and Wilson, R. 2002. The architecture of Why2-Atlas: a coach for qualitative physics essay writing. In Proceedings of Intelligent Tutoring Systems Conference, Biarritz, France.
Walker, M., Rambow, O. and Rogati, M. 2001. SPoT: a trainable sentence planner. In Proceedings of NAACL 2001, Pittsburgh, USA.
Walker, M., Whittaker, S., Stent, A., Maloor, P., Moore, J., Johnston, M., and Vasireddy, G. 2005. Generation and evaluation of user tailored responses in multimodal dialog. Cognitive Science 28: 811840.
Williams, J. 2008. Evaluating user simulations with the Cramer-von Mises divergence. Speech Communication 50 (10): 829846.


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed