Skip to main content Accessibility help
×
Hostname: page-component-84b7d79bbc-g78kv Total loading time: 0 Render date: 2024-07-28T01:01:53.127Z Has data issue: false hasContentIssue false

19 - Speech Synthesis: State of the Art and Challenges for the Future

from Part III - Machine Synthesis of Social Signals

Published online by Cambridge University Press:  13 July 2017

Kallirroi Georgila
Affiliation:
Institute of Creative Technologies
Judee K. Burgoon
Affiliation:
University of Arizona
Nadia Magnenat-Thalmann
Affiliation:
Université de Genève
Maja Pantic
Affiliation:
Imperial College London
Alessandro Vinciarelli
Affiliation:
University of Glasgow
Get access

Summary

Introduction

Speech synthesis (or alternatively text-to-speech synthesis) means automatically converting natural language text into speech. Speech synthesis has many potential applications. For example, it can be used as an aid to people with disabilities (see Challenges for the Future), for generating the output of spoken dialogue systems (Lemon et al., 2006; Georgila et al., 2010), for speech-to-speech translation (Schultz et al., 2006), for computer games, etc.

Current state-of-the-art speech synthesizers can simulate neutral read aloud speech (i.e., speech that sounds like reading from some text) quite well, both in terms of naturalness and intelligibility (Karaiskos et al., 2008). However, today, many commercial applications that require speech output still rely on prerecorded system prompts rather than use synthetic speech. The reason is that, despite much progress in speech synthesis over the last twenty years, current state-of-the-art synthetic voices still lack the expressiveness of human voices. On the other hand, using prerecorded speech has several drawbacks. It is a very expensive process that often has to start from scratch for each new application. Moreover, if an application needs to be enhanced with new prompts, it is quite likely that the person (usually an actor) that recorded the initial prompts will not be available. Furthermore, human recordings cannot be used for content generation on the fly, i.e., all the utterances that will be used in an application need to be predetermined and recorded in advance. Predetermining all utterances to be recorded is not always possible. For example, the number of names in the database of an automatic directory assistance service can be huge. Not to mention the fact that most databases are continuously being updated. In such cases, speech output is generated by using a mixture of prerecorded speech (for prompts) and synthetic speech (for names) (Georgila et al., 2003). The results of such a mixture can be quite awkward.

The discussion above shows that there is great motivation for further advances in the field of speech synthesis. Below we provide an overview of the current state of the art in speech synthesis, and present challenges for future work.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Adell, J., Bonafonte, A., & Escudero, D. (2006). Disfluent speech analysis and synthesis: A preliminary approach. In Proceedings of the International Conference on Speech Prosody.
Andersson, S., Georgila, K., Traum, D., Aylett, M., & Clark, R. A. J. (2010). Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proceedings of the International Conference on Speech Prosody.
Andersson, S., Yamagishi, J., & Clark, R. A. J. (2012). Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis.Speech Communication, 54(2), 175–188.Google Scholar
Artstein, R., Traum, D., Alexander, O., et al. (2014). Time-offset interaction with a holocaust survivor. In Proceedings of the International Conference on Intelligent User Interfaces (pp. 163–168).
Barra-Chicote, R., Yamagishi, J., King, S., Montero, J. M., & Macias-Guarasa, J. (2010). Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech.Speech Communication, 52(5), 394–404.Google Scholar
Black, A. W. & Lenzo, K. A. (2000). Limited domain synthesis. In Proceedings of the International Conference on Spoken Language Processing(vol. 2, pp. 411–414).Google Scholar
Black, A. W. & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Proceedings of the European Conference on Speech Communication and Technology (pp. 601–604).
Black, A. W. & Tokuda, K. (2005). The Blizzard challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proceedings of the European Conference on Speech Communication and Technology (pp. 77–80).
Bozkurt, B., Ozturk, O., & Dutoit, T. (2003). Text design for TTS speech corpus building using a modified greedy selection. In Proceedings of the European Conference on Speech Communication and Technology (pp. 277–280).
Campbell, N. (2006). Conversational speech synthesis and the need for some laughter.IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1171–1178.Google Scholar
Campbell, N. (2007). Towards conversational speech synthesis: Lessons learned from the expressive speech processing project. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 22–27).
DeVault, D., Artstein, R., Benn, G., et al. (2014). SimSensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (pp. 1061–1068).
Douglas-Cowie, E., Cowie, R., & Schröder, M. (2000). A new emotion database: Considerations, sources and scope. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion (pp. 39–44).
Georgila, K. (2009). Using integer linear programming for detecting speech disfluencies. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL),. Companion volume: short papers (pp. 109–112).
Georgila, K., Black, A. W., Sagae, K., & Traum, D. (2012). Practical evaluation of human and synthesized speech for virtual human dialogue systems. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 3519–3526).
Georgila, K., Sgarbas, K., Tsopanoglou, A., Fakotakis, N., & Kokkinakis, G. (2003). A speechbased human–computer interaction system for automating directory assistance services.International Journal of Speech Technology(special issue on Speech and Human Computer Interaction), 6(2), 145–159.Google Scholar
Georgila, K.,Wolters, M., Moore, J.d., & Logie, R. H. (2010). The MATCH corpus: A corpus of older and younger users' interactions with spoken dialogue systems.Language Resources and Evaluation, 44(3), 221–261.Google Scholar
Hunt, A. J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 373–376).
Imai, S., Sumita, K., & Furuichi, C. (1983). Mel log spectrum approximation (MLSA) filter for speech synthesis.Electronics and Communications in Japan, 66(2), 10–18.Google Scholar
Iskarous, K., Goldstein, L. M., Whalen, D. H., Tiede, M. K., & Rubin, P. E. (2003). CASY: The Haskins configurable articulatory synthesizer. In Proceedings of the International Congress of Phonetic Sciences (pp. 185–188).
Karaiskos, V., King, S., Clark, R. A. J., & Mayo, C. (2008). The Blizzard challenge 2008. In Proceedings of the Blizzard Challenge Workshop.
Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds.Speech Communication, 27(3– 4), 187–207.Google Scholar
King, S. (2010). A tutorial on HMM speech synthesis. In Sadhana – Academy Proceedings in Engineering Sciences, Indian Institute of Sciences.
Kishore, S. P. & Black, A. W. (2003). Unit size in unit selection speech synthesis. In Proceedings of the European Conference on Speech Communication and Technology (pp. 1317–1320).
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer.Journal of the Acoustical Society of America, 67(3), 971–995.Google Scholar
Kominek, J. & Black, A. W. (2004). The CMU ARCTIC speech databases. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 223–224).
Lemon, O., Georgila, K., Henderson, J., & Stuttle, M. (2006). An ISU dialogue system exhibiting reinforcement learning of dialogue policies: Generic slot-filling in the TALK in-car system. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL) – Demonstrations (pp. 119–122).
Ling, Z.-H., Richmond, K., Yamagishi, J., & Wang, R.-H. (2008). Articulatory control of HMMbased parametric speech synthesis driven by phonetic knowledge. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 573–576).
Ling, Z.-H. & Wang, R.-H. (2006). HMM-based unit-selection using frame sized speech segments. In Proceedings of the International Conference on Spoken Language Processing (pp. 2034–2037).
Narayanan, S., Alwan, A., & Haker, K. (1997). Toward articulatory-acoustic models for liquid approximants based on MRI and EPG data: Part I, The laterals.Journal of the Acoustical Society of America, 101(2), 1064–1077.Google Scholar
Navigli, R. (2009). Word sense disambiguation: A survey.ACM Computing Surveys, 41(2), art. 10.Google Scholar
Pitrelli, J. F., Bakis, R., Eide, E. M., et al. (2006). The IBM expressive text-to-speech synthesis system for American English.IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1099–1108.Google Scholar
Qin, L., Ling, Z.-H., Wu, Y.-J., Zhang, B.-F., & Wang, R.-H. (2006). HMM-based emotional speech synthesis using average emotion model.Lecture Notes in Computer Science, 4274, 233–240.Google Scholar
Sagisaka, Y., Kaiki, N., Iwahashi, N., & Mimura, K. (1992). ATR v-TALK speech synthesis system. In Proceedings of the International Conference on Spoken Language Processing (pp. 483–486).
Schultz, T., Black, A.W., Vogel, S., & Woszczyna, M. (2006). Flexible speech translation systems.IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 403–411.Google Scholar
Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing with compositional vector grammars. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 455–465).
Stylianou, Y. (1999). Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 377–380).
Sundaram, S. & Narayanan, S. (2002). Spoken language synthesis: Experiments in synthesis of spontaneous monologues. In Proceedings of the IEEE Speech Synthesis Workshop (pp. 203– 206).
Székely, É., Cabral, J. P., Abou-Zleikha, M., Cahill, P., & Carson-Berndsen, J. (2012). Evaluating expressive speech synthesis from audiobooks in conversational phrases. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 3335–3339).
Taylor, P. (2009). Text-to-speech Synthesis. New York: Cambridge University Press.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1315–1318).
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL) (pp. 173–180).
Von Kempelen, W. (1791). Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine. Vienna: J. V. Degen.
Werner, S., & Hoffmann, R. (2007). Spontaneous speech synthesis by pronunciation variant selection: A comparison to natural speech. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 1781–1784).
Yamagishi, J., Nose, T., Zen, H., et al. (2009). Robust speaker-adaptive HMM-based text-tospeech synthesis.IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1208–1230.Google Scholar
Yamagishi, J., Usabaev, B., King, S., et al. (2010). Thousands of voices for HMM-based speech synthesis-analysis and application of TTS systems built on various ASR corpora.IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 984–1004.Google Scholar
Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T., & Kitamura, T. (1997). Speaker interpolation in HMM-based speech synthesis system. In Proceedings of the European Conference on Speech Communication and Technology (pp. 2523–2526).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1998). Duration modeling for HMM-based speech synthesis. In Proceedings of the International Conference on Spoken Language Processing (pp. 29–32).
Young, S., Evermann, G., Gales, M., et al. (2009). The HTK Book (for HTK version 3.4). Cambridge: Cambridge University Press.
Zen, H., Nose, T., Yamagishi, J., et al. (2007). The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the ISCA Workshop on Speech Synthesis (pp. 294–299).
Zen, H., Tokuda, K., & Black, A.W. (2009). Statistical parametric speech synthesis.Speech Communication, 51(11), 1039–1064.Google Scholar
Zen, H., Tokuda, K., & Kitamura, T. (2007). Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences.Speech Communication, 21(1), 153–173.Google Scholar
Zhang, L., & Renals, S. (2008). Acoustic-articulatory modeling with the trajectory HMM.IEEE Signal Processing Letters, 15, 245–248.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×