Skip to main content Accessibility help
×
Hostname: page-component-77c89778f8-vpsfw Total loading time: 0 Render date: 2024-07-24T17:18:00.169Z Has data issue: false hasContentIssue false

12 - Eye tracking for the online evaluation of prosody in speech synthesis

from Part V - Evaluation and shared tasks

Published online by Cambridge University Press:  05 July 2014

Michael White
Affiliation:
Ohio State University
Rajakrishnan Rajkumar
Affiliation:
Ohio State University
Kiwako Ito
Affiliation:
Ohio State University
Shari R. Speer
Affiliation:
Ohio State University
Amanda Stent
Affiliation:
AT&T Research, Florham Park, New Jersey
Srinivas Bangalore
Affiliation:
AT&T Research, Florham Park, New Jersey
Get access

Summary

Introduction

The past decade has witnessed remarkable progress in speech synthesis research, to the point where synthetic voices can be hard to distinguish from natural ones, at least for utterances with neutral, declarative prosody. Neutral intonation often does not suffice, however, in interactive systems: instead it can sound disengaged or “dead,” and can be misleading as to the intended meaning.

For concept-to-speech systems, especially interactive ones, natural language generation researchers have developed a variety of methods for making contextually appropriate prosodic choices, depending on discourse-related factors such as givenness, parallelism, or theme/rheme alternative sets, as well as information-theoretic considerations (Prevost, 1995; Hitzeman et al., 1998; Pan et al., 2002; Bulyko and Ostendorf, 2002; Theune, 2002; Kruijff-Korbayová et al., 2003; Nakatsu and White, 2006; Brenier et al., 2006; White et al., 2010). In this setting, it is possible to adapt limited-domain synthesis techniques to produce utterances with perceptually distinguishable, contextually varied intonation (see Black and Lenzo, 2000; Baker, 2003; van Santen et al., 2005; Clark et al., 2007, for example). To evaluate these utterances, listening tests have typically been employed, sometimes augmented with expert evaluations. For example, evaluating the limited domain voice used in the FLIGHTS concept-to-speech system (Moore et al., 2004; White et al., 2010) demonstrated that the prosodic specifications produced by the natural language generation component of the system yielded significantly more natural synthetic speech in listening tests and, in an expert evaluation, compared to two baseline voices.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baker, R. E. (2003). Using Unit Selection to Synthesise Contextually Appropriate Intonation in Limited Domain Synthesis. Master's thesis, Department of Linguistics, University of Edinburgh.Google Scholar
Beckman, M. E., Hirshberg, J., and Shattuck-Hufnagel, S. (2005). The original ToBI system and the evolution of the ToBI framework. In Jun, S.-A., editor, Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford, UK.Google Scholar
Black, A. and Lenzo, K. (2000). Limited domain synthesis. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 411-414, Beijing, China. International Speech Communication Association.Google Scholar
Brenier, J., Nenkova, A., Kothari, A., Whitton, L., Beaver, D., and Jurafsky, D. (2006). The (non)utility of linguistic features for predicting prominence in spontaneous speech. In Proceedings of the IEEE Spoken Language Technology Workshop, pages 54-57, Palm Beach, FL. Institute of Electrical and Electronics Engineers.Google Scholar
Bulyko, I. and Ostendorf, M. (2002). Efficient integrated response generation from multiple targets using weighted finite state transducers. Computer Speech and Language, 16(3-1): 533-550.CrossRefGoogle Scholar
Clark, R. A. J., Richmond, K., and King, S. (2007). Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317-330.CrossRefGoogle Scholar
De Carolis, B., Pelachaud, C., Poggi, I., and Steedman, M. (2004). APML, a markup language for believable behavior generation. In Prendinger, H. and Ishizuka, M., editors, Life-like Characters. Tools, Affective Functions and Applications, pages 65-85. Springer, Berlin, Germany.Google Scholar
Espinosa, D., White, M., Fosler-Lussier, E., and Brew, C. (2010). Machine learning for text selection with expressive unit-selection voices. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 1125-1128, Makuhari, Chiba, Japan. International Speech Communication Association.Google Scholar
Hitzeman, J., Black, A. W., Mellish, C, Oberlander, J., and Taylor, P. (1998). On the use of automatically generated discourse-level information in a concept-to-speech synthesis system. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Sydney, Australia. International Speech Communication Association.Google Scholar
Ito, K. and Speer, S. R. (2008). Use of L + H* for immediate contrast resolution. In Proceedings of Speech Prosody, Campinas, Brazil. Speech Prosody Special Interest Group.Google Scholar
Ito, K. and Speer, S. R. (2011). Semantically-independent but contextually-dependent interpretation of contrastive accent. In Frota, S., Elordieta, G., and Prieto, P., editors, Prosodic Categories: Production, Perception and Comprehension, pages 69-92. Springer, Dordrecht, The Netherlands.Google Scholar
Karaiskos, V., King, S., Clark, R. A. J., and Mayo, C. (2008). The Blizzard Challenge 2008. In Proceedings of the Blizzard Challenge (in conjunction with the ISCA Workshop on Speech Synthesis), Brisbane, Australia. Blizzard.Google Scholar
King, S. and Karaiskos, V. (2009). The Blizzard Challenge 2009. In Proceedings of the Blizzard Challenge, Edinburgh, Scotland. University of Edinburgh.Google Scholar
Kruijff-Korbayová, I., Ericsson, S., Rodríguez, K. J., and Karagjosova, E. (2003). Producing con-textually appropriate intonation in an information-state based dialogue system. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 227-234, Budapest, Hungary. Association for Computational Linguistics.Google Scholar
Mayo, C., Clark, R. A. J., and King, S. (2011). Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3): 311-326.CrossRefGoogle Scholar
Moore, J. D., Foster, M. E., Lemon, O., and White, M. (2004). Generating tailored, comparative descriptions in spoken dialogue. In Proceedings of the Florida Artificial Intelligence Research Society Conference (FLAIRS), pages 917-922, Miami Beach, FL. The Florida Artificial Intelligence Research Society.Google Scholar
Nakatsu, C and White, M (2006) Learning to say it well: Reranking realizations by predicted synthesis quality. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1113-1120, Sydney, Australia. Association for Computational Linguistics.Google Scholar
Pan, S., McKeown, K., and Hirschberg, J. (2002). Exploring features from natural language generation for prosody modeling. Computer Speech and Language, 16:457-490.CrossRefGoogle Scholar
Prevost, S. (1995). A Semantics of Contrast and Information Structure for Specifying Intonation in Spoken Language Generation. PhD thesis, University of Pennsylvania.Google Scholar
Rajkumar, R., White, M., Ito, K., and Speer, S. R. (2010). Evaluating prosody in synthetic speech with online (eye-tracking) and offline (rating) methods. In Proceedings of the ISCA Workshop on Speech Synthesis (SSW), pages 276-281, Kyoto, Japan. International Speech Communication Association.Google Scholar
Speer, S. R. (2011). Eye movements as a measure of spoken language processing. In Cohn, A. C., Fougeron, C., and Huffman, M. K., editors, The Oxford Handbook of Laboratory Phonology, pages 580-592. Oxford University Press, Oxford, UK.Google Scholar
Swift, M. D., Campana, E., Allen, J. F., and Tanenhaus, M. K. (2002). Monitoring eye movements as an evaluation of synthesized speech. In Proceedings of the IEEE Workshop on Speech Synthesis, pages 19-22, Santa Monica, CA. Institute of Electrical and Electronics Engineers.Google Scholar
Theune, M. (2002). Contrast in concept-to-speech generation. Computer Speech and Language, 16(3-4):491-531.CrossRefGoogle Scholar
van Hooijdonk, C, Commandeur, E., Cozijn, R., Krahmer, E., and Marsi, E. (2007). The online evaluation of speech synthesis using eye movements. In Proceedings of the ISCA Workshop on Speech Synthesis, pages 385-390, Bonn, Germany. International Speech Communication Association.Google Scholar
van Santen, J., Kain, A., Klabbers, E., and Mishra, T. (2005). Synthesis of prosody using multilevel unit sequences. Speech Communication, 46(3-4):365-375.CrossRefGoogle Scholar
White, M., Clark, R. A. J., and Moore, J. (2010). Generating tailored, comparative descriptions with contextually appropriate intonation. Computational Linguistics, 36(2):159-201.CrossRefGoogle Scholar
White, M., Rajkumar, R., Ito, K., and Speer, S. R. (2009). Eye tracking for the online evaluation of prosody in speech synthesis: Not so fast! In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 2523-2526, Brighton, UK. International Speech Communication Association.Google Scholar
Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11): 1039-1064.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×