Beyond Functional Speech Synthesis

doi:10.1017/9781108644198.016

15 - Beyond Functional Speech Synthesis

from Section III - Measuring Speech

Published online by Cambridge University Press: 11 November 2021

Rupal Patel ,

Geoffrey Meltzner and

Markus Toman

Edited by

Rachael-Anne Knight and

Jane Setter

Show author details

Rachael-Anne Knight: Affiliation:
City, University of London
Jane Setter: Affiliation:
University of Reading

Book contents

Get access

Summary

As synthetic voices enter the mass market, there is an increasing need for voice personalisation, that is, a voice for the text-to-speech system that not only conveys information but also exudes a persona much like the human voice. We begin this chapter with a historical overview of the field starting with model-based approaches, to concatenative systems and finally to contemporary implementations of parametric synthesis. We then examine how the confluence of increased computational speed at reduced costs, the availability of large data sets, and advances in machine learning and artificial intelligence enable a whole new approach to speech synthesis, including the ability to create high-quality personalised voices. We then examine the role of crowdsourcing in developing a scalable method for voice customisation and adaptation. We discuss the benefits and challenges of acquiring recordings of novice voice talent of all ages from around the world and using these recordings for voice building. We conclude by discussing the impact of personalised voices and implications for future work.

Keywords

synthesis text-to-dpeech deep learning AAC personalisation

Type: Chapter
Information: The Cambridge Handbook of Phonetics , pp. 387 - 404

DOI: https://doi.org/10.1017/9781108644198.016 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

15.7 References

Adank, P., Stewart, A. J., Connell, I. & Wood, J. (2013). Accent imitation positively affects language attitudes. Frontiers of Psychology, 4, 280.Google Scholar

Bachorowski, J. A. & Owren, M. J. (1999). Acoustic correlates of talker sex and individual talker identity are present in a short vowel segment produced in running speech. Journal of the Acoustical Society of America, 106(2), 1054–63.Google Scholar

Black, A. W., Zen, H. & Tokuda, K. (2007). Statistical parametric speech synthesis. Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV-1229–32.CrossRef Google Scholar

Cohen, M. H., Giangola, J. P. & Balogh, J. (2004). Voice User Interface Design. Redwood City, CA: Addison-Wesley Longman.Google Scholar

Collins, S. A. (2000). Male voices and women’s choices. Animal Behavior, 60(6), 773–80.Google Scholar

Feinberg, D. R., Jones, B. C., Little, A. C. & Perrett, D. I. (2005). Manipulations of fundamental and formant frequencies influence the attractiveness of human male voices. Animal Behavior, 69(3), 561–8.Google Scholar

Flanagan, J. L. (1965). Speech Analysis, Synthesis and Perception. Berlin: Springer-Verlag.Google Scholar

Flanagan, J. L. (1972). Voices of men and machines. Journal of the Acoustical Society of America, 51, 1375–87.Google Scholar

Fitch, W. T. & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. Journal of the Acoustical Society of America, 106(3), 1511–22.CrossRef Google Scholar PubMed

Hartman, D. E. & Danhauer, J. L. (1976). Perceptual features of speech for males in four perceived age decades. Journal of the Acoustical Society of America, 59(3), 713–15.Google Scholar

Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–80.Google Scholar

Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E. et al. (2018). Efficient neural audio synthesis, arXiv, 1802.08435.Google Scholar

Jia, Y., Zhang, Y., Weiss, R. Wang, Q., Shen, J., Ren, F. et al. (2019). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv, 1806.04558.Google Scholar

Kinsella, B. (2019). Why tech giants are so desperate to provide your voice assistant. Harvard Business Review, https://hbr.org/2019/05/why-tech-giants-are-so-desperate-to-provide-your-voice-assistant.Google Scholar

Knudson, J. (2019). Digital publishers prepare for the voice revolution. Econtent Magazine, www.econtentmag.com/Articles/Editorial/Feature/Digital-Publishers-Prepare-for-the-Voice-Revolution-130768.htm.Google Scholar

Light, J. C. & McNaughton, D. (2014). Communicative competence for individuals who require augmentative and alternative communication: A new definition for a new era of communication? Augmentative and Alternative Communication, 30(1), 1–18.Google Scholar

Linville, S. (1998). Acoustic correlates of perceived versus actual sexual orientation in men’s speech. Pholia Phoniatrica et Logopaedica, 50(1), 35–48.Google Scholar

Munson, B., McDonald, E., DeBoe, N. & White, A. (2006). The acoustic and perceptual bases of judgments of women and men’s sexual orientation from read speech. Journal of Phonetics, 34(2), 202–40.Google Scholar

Peschke, C., Ziegler, W., Eisenberger, J. & Baumbaertner, A. (2012). Phonological manipulation between speech perception and production activated a parieto-frontal circuit. NeuroImage, 59, 788–99.Google Scholar

Pierrehumbert, J., Bent, T., Munson, B., Bradlow, A. R. & Bailey, J. M. (2004). The influence of sexual orientation on vowel production. Journal of the Acoustic Society of America, 116, 1905–8.CrossRef Google Scholar PubMed

Rabiner, L. & Juang, B. J. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.Google Scholar

Ridley, L. [Lost Voice Guy]. (2012).Voice by Choice. Comedy sketch by Lee Ridley, Lost VoiceGuy[Video File]. Retrieved from www.youtube.com/watch?v=CMm_XL3Ipbo.Google Scholar

Schabus, D. (2009). Interpolation of Austrian German and Viennese Dialect/Sociolect in HMM-based Speech Synthesis. Thesis, Vienna University of Technology.Google Scholar

Smyth, R., Jacobs, G. & Rogers, H. (2003). Male voices and perceived sexual orientation: An experiment and theoretical approach. Language and Society, 32(2), 329–50.Google Scholar

Stevens, K. (1998). Acoustic Phonetics. Cambridge, MA: MIT Press.Google Scholar

Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge: Cambridge University Press.CrossRef Google Scholar

Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J. & Oura, K. (2013). Speech synthesis based on Hidden Markov Models. Proceedings of the IEEE, 101(5), 1234–52.CrossRef Google Scholar

Toman, M. (2016). Transformation and Interpolation of Language Varieties for Speech Synthesis. Thesis, Vienna University of Technology.Google Scholar

Toman, M., Pucher, M. & Moosmüller, S. (2015). Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176–93.Google Scholar

Toman, M, Meltzner, G. S. & Patel, R. (2018). Data requirements and augmentation for DNN-based speech synthesis from crowdsourced data. In Proceedings of INTERSPEECH 2018, Hyderabad, pp. 2878–82.Google Scholar

van den Oord, A, Dieleman, S., Zen, H., Simonya, K, Vinyals, O., Graves, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv: 1609.03499.Google Scholar

Walton, J. & Orlikoff, R. (1994). Speaker race identification from acoustic cues in the vocal signal. Journal of Speech, Language, and Hearing Research, 37(4), 738–45.Google Scholar

Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N. et al. (2017). Tacotron: Towards end-to-end speech synthesis. Proceedings of INTERSPEECH 2017, Stockholm, pp. 4006–10.Google Scholar

Young, S. (2010). Cognitive user interfaces. IEEE Signal Processing Magazine, 27(3), 128–40.Google Scholar

Zen, H., Senior, A. and Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–6.Google Scholar

Zen, H., Agiomyrgiannakis, Y., Egberts, N., Henderson, F. & Szczepaniak, P. (2016). Fast, compact, and high quality LSTM-RNN-based statistical parametric speech synthesizers for mobile devices. In Proceedings of INTERSPEECH 2016, San Francisco, pp. 2273–7.Google Scholar

Zuckerman, M. & Miyake, K. (1993). The attractive voice: What makes it so? Journal of Nonverbal Behavior, 17(2), 119–35.Google Scholar