The past decade has witnessed remarkable progress in speech synthesis research, to the point where synthetic voices can be hard to distinguish from natural ones, at least for utterances with neutral, declarative prosody. Neutral intonation often does not suffice, however, in interactive systems: instead it can sound disengaged or “dead,” and can be misleading as to the intended meaning.
For concept-to-speech systems, especially interactive ones, natural language generation researchers have developed a variety of methods for making contextually appropriate prosodic choices, depending on discourse-related factors such as givenness, parallelism, or theme/rheme alternative sets, as well as information-theoretic considerations (Prevost, 1995; Hitzeman et al., 1998; Pan et al., 2002; Bulyko and Ostendorf, 2002; Theune, 2002; Kruijff-Korbayová et al., 2003; Nakatsu and White, 2006; Brenier et al., 2006; White et al., 2010). In this setting, it is possible to adapt limited-domain synthesis techniques to produce utterances with perceptually distinguishable, contextually varied intonation (see Black and Lenzo, 2000; Baker, 2003; van Santen et al., 2005; Clark et al., 2007, for example). To evaluate these utterances, listening tests have typically been employed, sometimes augmented with expert evaluations. For example, evaluating the limited domain voice used in the FLIGHTS concept-to-speech system (Moore et al., 2004; White et al., 2010) demonstrated that the prosodic specifications produced by the natural language generation component of the system yielded significantly more natural synthetic speech in listening tests and, in an expert evaluation, compared to two baseline voices.