Skip to main content Accessibility help
×
Home
Audiovisual Speech Processing
  • Get access
    Check if you have access via personal or institutional login
  • Cited by 6
  • Export citation
  • Recommend to librarian
  • Buy the print book

Book description

When we speak, we configure the vocal tract which shapes the visible motions of the face and the patterning of the audible speech acoustics. Similarly, we use these visible and audible behaviors to perceive speech. This book showcases a broad range of research investigating how these two types of signals are used in spoken communication, how they interact, and how they can be used to enhance the realistic synthesis and recognition of audible and visible speech. The volume begins by addressing two important questions about human audiovisual performance: how auditory and visual signals combine to access the mental lexicon and where in the brain this and related processes take place. It then turns to the production and perception of multimodal speech and how structures are coordinated within and across the two modalities. Finally, the book presents overviews and recent developments in machine-based speech recognition and synthesis of AV speech.

Refine List

Actions for selected content:

Select all | Deselect all
  • View selected items
  • Export citations
  • Download PDF (zip)
  • Send to Kindle
  • Send to Dropbox
  • Send to Google Drive

Save Search

You can save your searches here and later view and run them again in "My saved searches".

Please provide a title, maximum of 40 characters.
×

Contents

References

Abry, C. and P. Badin (1996). Speech mapping as a framework for an integrated approach to the sensori-motor foundations of language. ETRW on Speech Production Modelling: from Control Strategies to Acoustics. Autrans, France.
Abry, C., P. Badin, K. Mawass and X. Pelorson (1998). “The Equilibrium Point Hypothesis and control space for relaxation movements or ‘When is movement actually needed to control movement?’” Les Cahiers de l’ICP, Bulletin de la Communication Parlée. Grenoble. 4: 27–33.
Abry, C., P. Badin and C. Scully (1994). Sound-to-gesture inversion in speech: the Speech Maps approach. Advanced speech applications. K. Varghese, S. Pfleger and J. P. Lefèvre. Berlin, Springer: 182–196.
Abry, C. and L.-J. Boë (1986). “Laws for lips.” Speech Communication 5: 97–104.
Abry, C., L.-J. Boë, P. Corsi, R. Descout, M. Gentil and P. Graillot (1980). Labialité et phonétique. Données fondamentales et études expérimentales sur la géométrie et la motricité labiales. Grenoble, Université des Langues et Lettres.
Abry, C., M.-A. Cathiard, J. Robert-Ribès and J.-L. Schwartz (1994). “The coherence of speech in audio-visual integration.” Cahiers de Psychologie Cognitive/Current Psychology of Cognition 13(1): 52–59.
Abry, C. and M.-T. Lallouache (1995a). “Pour un modèle d’anticipation dépendant du locuteur. Données sur l’arrondissement en français.” Bulletin de la Communication Parlée. 3: 85–99.
Abry, C., M.-T. Lallouache and M.-A. Cathiard (1996). How can coarticulation models account for speech sensitivity to audio-visual desynchronization? Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 247–255.
Abry, C. and T. Lallouache (1995b). Modeling lip constriction anticipatory behaviour for rounding in French with the MEM (Movement Expansion Model). Proceedings of the International Congress of Phonetic Sciences. Stockholm, Sweden, pp. 152–155.
Abry, C., M. Stefanuto, A. Vilain and R. Laboissière (2002). What can the utterance “tan, tan” of Broca’s patient Leborgne tell us about the hypothesis of an emergent “babble-syllable” downloaded by SMA? Phonetics, Phonology and Cognition. J. Durand and B. Laks. Oxford University Press: 226–243.
Abry, C., A. Vilain and J.-L. Schwartz (2009). Vocalize to Localize. Amsterdam, Benjamins Current Topics 13.
Adjoudani, A. and C. Benoît (1996). On the integration of auditory and visual parameters in an HMM-based ASR. Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 461–472.
Adjoudani, A., T. Guiard-Marigny, B. Le Goff, L. Revéret and C. Benoît (1997). A multimedia platform for audio-visual speech processing. European Conference on Speech Communication and Technology. Rhodes, Greece, pp. 1671–1674.
Agelfors, E., J. Beskow, B. Granström, M. Lundeberg, G. Salvi, K.-E. Spens and T. Öhman (1999). Synthetic visual speech driven from auditory speech. Auditory-visual Speech Processing Workshop. Santa Cruz, CA, pp. 123–127.
Aldridge, J. and G. Knupfer (1994). “Public safety: improving the effectiveness of CCTV security systems.” Journal of Forensic Science Society 34(4): 257–263.
Alégria, J., J. Leybaert, B. Charlier and C. Hage (1992). On the origin of phonological representations in the deaf : Hearing lips and hands. Analytic Approaches to Human Cognition. J. Alégria, D. Holender, J. Junça de Morais and M. Radeu. Amsterdam, Elsevier: 107–132.
Aleksic, P. S. and A. K. Katsaggelos (2003). An audio-visual person identification and verification system using FAPs as visual features. Workshop on Multimedia User Authentication. Santa Barbara, CA, pp. 80–84.
Aleksic, P. S. and A. K. Katsaggelos (2004a). Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition. International Conference on Acoustics, Speech and Signal Processing. Montreal, Canada, pp. 917–920.
Aleksic, P. S. and A. K. Katsaggelos (2004b). “Speech-to-video synthesis using MPEG-4 compliant visual features.” IEEE Transactions on Circuits and Systems for Video Technology 14(5): 682–692.
Aleksic, P. S. and A. K. Katsaggelos (2006). “Audio-visual biometrics.” Proceedings of the IEEE 94(11): 2025–2044.
Aleksic, P. S., G. Potamianos and A. K. Katsaggelos (2005). Exploiting visual information in automatic speech processing. Handbook of Image and Video Processing. A. Bovic. Burlington, MA, Elsevier: 1263–1289.
Aleksic, P. S., J. J. Williams, Z. Wu and A. K. Katsaggelos (2002). “Audio-visual speech recognition using MPEG-4 compliant visual features.” EURASIP Journal on Applied Signal Processing 1: 1213–1227.
Alissali, M., P. Deléglise and A. Rogozan (1996). Asynchronous integration of visual information in an automatic speech recognition system. International Conference on Spoken Language Processing. Philadelphia, PA, pp. 34–37.
Alley, T. R. (1999). Perception of static and dynamic computer displays of facial appearance in applied settings. International Conference on Perception and Action. Edinburgh.
Allison, T., A. Puce and G. McCarthy (2000). “Social Perception from visual cues: role of STS region.” Trends in Cognitive Science 4: 267–278.
Anastasakos, T., J. McDonough and J. Makhoul (1997). Speaker adaptive training: A maximum likelihood approach to speaker normalization. International Conference on Acoustics, Speech and Signal Processing. Munich, Germany, pp. 1043–1046.
Andersson, U. and B. Lidestam (2005). “Bottom-up driven speechreading in a speechreading expert: the case of AA.” Ear and Hearing 26: 214–224.
André-Obrecht, R., B. Jacob and N. Parlangeau (1997). Audiovisual speech recognition and segmental master slave HMM. European Tutorial Workshop on Audio-Visual Speech Processing. Rhodes, Greece, pp. 49–52.
Andruski, J. E., S. E. Blumstein and M. Burton (1994). “The effect of subphonetic differences on lexical access.” Cognition 52: 163–187.
Angkititrakul, P., J. H. L. Hansen, S. Choi, T. Creek, J. Hayes, J. Kim, D. Kwak, L. T. Noecker and A. Phan (2009). UT Drive: The smart vehicle project. In-Vehicle Corpus and Signal Processing for Driver Behavior. K. Takeda, J. H. L. Hansen, H. Erdoggan and H. Abut. New York, Springer: 55–67.
Anthony, D., E. Hines, J. Barham and D. Taylor (1990). A comparison of image compression by neural networks and principal component analysis. IEEE International Joint Conference on Neural Networks. San Diego, CA, pp. 339–344.
Arikan, O. and D. Forsyth (2002). Interactive motion generation from examples. SIGGRAPH. San Antonio, TX, pp. 483–490.
Arnal, L. H., B. Morillon, C. A. Kell and A. L. Giraud (2009). “Dual neural routing of visual facilitation in speech processing.” Journal of Neurosciences 29(43): 13445–13453.
Atal, B. S., J. J. Chang, M. V. Mathews and J. W. Tukey (1978). “Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique.” Journal of the Acoustical Society of America 63: 1535–1555.
Attina, V., D. Beautemps and M.-A. Cathiard (2002a). Contrôle de l’anticipation vocalique d’arrondissement en Langage Parlé Complété. Journées d’Etudes sur la Parole. Nancy, France.
Attina, V., D. Beautemps and M.-A. Cathiard (2002b). Coordination of hand and orofacial movements for CV sequences in French cued speech. International Conference on Speech and Language Processing. Boulder, USA, pp. 1945–1948.
Attina, V., D. Beautemps and M.-A. Cathiard (2002c). Organisation spatio-temporelle main–lèvres–son de séquences CV en Langage Parlé Complété. Journées d’Etudes sur la Parole. Nancy, France.
Attina, V., D. Beautemps and M.-A. Cathiard (2003a). Temporal organization of French cued speech production. International Conference of Phonetic Sciences. Barcelona, Spain, pp. 1935–1938.
Attina, V., D. Beautemps and M.-A. Cathiard (2004a). Cued Speech production: giving a hand to speech acoustics. CFA/DAGA’04. Strasbourg, France, pp. 1143–1144.
Attina, V., D. Beautemps, M.-A. Cathiard and M. Odisio (2003b). Towards an audiovisual synthesizer for Cued Speech: rules for CV French syllables. Auditory-Visual Speech Processing. St-Jorioz, France, pp. 227–232.
Attina, V., D. Beautemps, M.-A. Cathiard and M. Odisio (2004b). “A pilot study of temporal organization in cued speech production of French syllables: rules for a cued speech synthesizer.” Speech Communication 44: 197–214.
Attina, V., M.-A. Cathiard and D. Beautemps (2002d). Controlling anticipatory behavior for rounding in French cued speech. Proceedings of ICSLP. Denver, Colorado, pp. 1949–1952.
Aubergé, V. and L. Lemaître (2000). The prosody of smile. ISCA Workshop on Speech and Emotion. Newcastle, Ireland, pp. 122–126.
Audouy, M. (2000). Logiciel de traitement d’images vidéo pour la détermination de mouvements des lèvres. Grenoble, ENSIMAG, Projet de fin d’études, option génie logiciel.
Auer, E. T. and L. E. Bernstein (1997). “Speechreading and the structure of the lexicon: Computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness.” Journal of the Acoustical Society of America 102: 3704–3710.
Auer, E. T., L. E. Bernstein, R. S. Waldstein and P. E. Tucker (1997). Effects of phonetic variation and the structure of the lexicon on the uniqueness of words. ESCA Workshop on Audio-visual Speech Processing: Cognitive and Computational Approaches. Rhodes, Greece, pp. 21–24.
Auer, E. T., Jr., (2002). “The influence of the lexicon on speechread word recognition: contrasting segmental and lexical distinctiveness.” Psychonomic Bulletin & Review 9(2): 341–347.
Auer, E. T., Jr. and L. E. Bernstein (2007). “Enhanced visual speech perception in individuals with early onset hearing impairment.” Journal of Speech, Language, and Hearing Research 50(5): 1157–1165.
Auer, E. T., Jr., L. E. Bernstein and P.E. Tucker (2000). “Is subjective word familiarity a meter of ambient language? A natural experiment on effects of perceptual experience.” Memory and Cognition 28: 789–797.
Badin, P., G. Bailly, M. Raybaudi and C. Segebarth (1998). A three-dimensional linear articulatory model based on MRI data. International Conference on Speech and Language Processing. Sydney, Australia, pp. 417–420.
Badin, P., G. Bailly, L. Revéret, M. Baciu, C. Segebarth and C. Savariaux (2002). “Three-dimensional linear articulatory modeling of tongue, lips and face based on MRI and video images.” Journal of Phonetics 30(3): 533–553.
Badin, P., P. Borel, G. Bailly, L. Revéret, M. Baciu and C. Segebarth (2000). Towards an audiovisual virtual talking head: 3D articulatory modeling of tongue, lips and face based on MRI and video images. Proceedings of the 5th Speech Production Seminar. Kloster Seeon, Germany, pp. 261–264.
Badin, P. and A. Serrurier (2006). Three-dimensional linear modeling of tongue: Articulatory data and models. International Seminar on Speech Production (ISSP). Ubatuba, SP, Brazil, pp. 395–402.
Badin, P., Y. Tarabalka, F. Elisei and G. Bailly (2008). Can you “read tongue movements”? Interspeech. Brisbane, Australia, pp. 2635–2637.
Badin, P., Y. Tarabalka, F. Elisei and G. Bailly (2010). “Can you ‘read tongue movements’? Evaluation of the contribution of tongue display to speech understanding.” Speech Communication 52(4): 493–503.
Baer, T., P. Alfonso and K. Honda (1998). “Electromyography of the tongue muscles during vowels in /@pVp/ environment.” Annual Bulletin of the Research Institute for Logopedics and Phoniatrics 22: 7–19.
Bahl, L. R., P. F. Brown, P. V. DeSouza and L. R. Mercer (1986). Maximum mutual information estimation of hidden Markov model parameters for speech recognition. International Conference on Acoustics, Speech and Signal Processing. Tokyo, Japan, pp. 49–52.
Bailly, G. (1997). “Learning to speak. Sensori-motor control of speech movements.” Speech Communication 22(2–3): 251–267.
Bailly, G. (2002). Audiovisual speech synthesis. From ground truth to models. International Conference on Speech and Language Processing. Boulder, Colorado, pp. 1453–1456.
Bailly, G. and P. Badin (2002). Seeing tongue movements from outside. International Conference on Speech and Language Processing. Boulder, Colorado, pp. 1913–1916.
Bailly, G., P. Badin, D. Beautemps and F. Elisei (2010). Speech technologies for augmented communication. Computer Synthesized Speech Technologies: Tools for Aiding Impairment. J. Mullennix and S. Stern. Hershey, PA, IGI Global: 116–128.
Bailly, G., A. Bégault, F. Elisei and P. Badin (2008a). Speaking with smile or disgust: data and models. Auditory-Visual Speech Processing Workshop (AVSP). Tangalooma, Australia, pp. 111–116.
Bailly, G., M. Bérar, F. Elisei and M. Odisio (2003). “Audiovisual speech synthesis.” International Journal of Speech Technology 6: 331–346.
Bailly, G., O. Govokhina, G. Breton, F. Elisei and C. Savariaux (2008b). The trainable trajectory formation model TD-HMM parameterized for the LIPS 2008 challenge. Interspeech. Brisbane, Australia, pp. 2318–2321.
Bailly, G., O. Govokhina, F. Elisei and G. Breton (2009). “Lip-synching using speaker-specific articulation, shape and appearance models.” Journal of Acoustics, Speech and Music Processing. Special issue on “Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation” ID 769494: 11 pages.
Bailly, G., R. Laboissière and A. Galván (1997). Learning to speak: Speech production and sensori-motor representations. Self-Organization, Computational Maps and Motor Control. P. Morasso and V. Sanguineti. Amsterdam, Elsevier: 593–615.
Bailly, G., R. Laboissière and J.-L. Schwartz (1991). “Formant trajectories as audible gestures: An alternative for speech synthesis.” Journal of Phonetics 19(1): 9–23.
Banesty, J., J. Chen, Y. Huang and I. Cohen (2009). Noise Reduction in Speech Processing. Berlin, Springer.
Barker, J. and X. Shao (2009). “Energetic and informational masking effects in an audiovisual speech recognition system.” IEEE Transactions on Audio, Speech, and Language Processing 17(3): 446–458.
Barker, J. P. and F. Berthommier (1999). Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models. International Conference on Auditory-Visual Speech Processing. Santa Cruz, CA, pp. 112–117.
Barker, L. J. (2003). “Computer-assisted vocabulary acquisition: The CSLU vocabulary tutor in oral-deaf education.” Journal of Deaf Studies and Deaf Education 8: 187–198.
Barron, J. L., D. J. Fleet and S. Beauchemin (1994). “Performance of optical flow techniques.” International Journal of Computer Vision 12(1): 43–77.
Bartlett, J. C., S. Hurry and W. Thorley (1984). “Typicality and familiarity of faces.” Memory & Cognition 12: 219–228.
Bassili, J. N. (1978). “Facial motion in the perception of faces and of emotional expressions.” Journal of Experimental Psychology: Human Perception and Performance 4: 373–379.
Basu, S., I. Essa and A. Pentland (1996). Motion regularization for model-based head tracking. International Conference on Pattern Recognition (ICPR). Vienna, Austria, pp. 611–616.
Basu, S., N. Oliver and A. Pentland (1998). 3D modeling and tracking of human lip motions. International Conference on Computer Vision. Mumbai, India, pp. 337–343.
Baudouin, J. Y., S. Sansone and G. Tiberghien (2000). “Recognizing expression from familiar and unfamiliar faces.” Pragmatics & Cognition 8(1): 123–147.
Bavelier, D. and H. J. Neville (2002). “Cross-modal plasticity: where and how?Nature Reviews Neuroscience 3(6): 443–452.
Beale, R. and T. Jackson (1990). Neural Computing: An Introduction. Bristol, Institute of Physics Publishing.
Beautemps, D., P. Badin and G. Bailly (2001). “Degrees of freedom in speech production: analysis of cineradio- and labio-films data for a reference subject, and articulatory-acoustic modeling.” Journal of the Acoustical Society of America 109(5): 2165–2180.
Beautemps, D., P. Borel and S. Manolios (1999). Hyper-articulated speech: auditory and visual intelligibility. EuroSpeech. Budapest, Hungary, pp. 109–112.
Beier, T. and S. Neely (1992a). “Feature-based image metamorphosis.” Computer Graphics 26(2): 35–42.
Beier, T. and S. Neely (1992b). Feature-based image metamorphosis. Computer Graphics (Proc. ACM SIGGRAPH), ACM, pp. 35–42.
Bell, A. M. (1867). Visible Speech: The Science of Universal Alphabetics. London, Simkin, Marshall & Co.
Ben Youssef, A., P. Badin and G. Bailly (2010). Can tongue be recovered from face? The answer of data-driven statistical models. Interspeech. Tokyo, pp. 2002–2005.
Ben Youssef, A., P. Badin, G. Bailly and P. Heracleous (2009). Acoustic-to-articulatory inversion using speech recognition and trajectory formation based on phoneme hidden Markov models. Interspeech. Brighton, pp. 2255–2258.
Benguerel, A.-P. and M. K. Pichora-Fuller (1982). “Coarticulation effects in lipreading.” Journal of Speech and Hearing Research 25: 600–607.
Benoît, C. (2000). Bimodal speech: Cognitive and computational approaches. International Conference on Human and Machine Processing of Language and Speech. Bangkok, Chulalongkorn University Press, pp. 314–328.
Benoît, C., C. Abry, M.-A. Cathiard, T. Guiard-Marigny and M.-T. Lallouache (1995a). Read my lips: Where? How? When? And so . . . What? 8th International Congress on Event Perception & Action. Marseille, France, pp. 423–426.
Benoît, C., T. Guiard-Marigny, B. Le Goff and A. Adjoudani (1995b). Which components of the face do humans and machines best speechread? Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 315–328.
Benoît, C., T. Lallouache, T. Mohamadi and C. Abry (1992). A set of French visemes for visual speech synthesis. Talking Machines: Theories, Models and Designs. G. Bailly and C. Benoît. Amsterdam, Elsevier: 485–501.
Benoît, C. and B. Le Goff (1998). “Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP.” Speech Communication 26: 117–129.
Benoît, C. and L. C. W. Pols (1992). On the assessment of synthetic speech. Talking Machines: Theories, Models and Designs. G. Bailly, C. Benoît and T. R. Sawallis. Amsterdam, Elsevier: 435–441.
Benoit, M. M., T. Raij, F. H. Lin, I. P. Jääskeläinen and S. Stufflebeam (2010). “Primary and multisensory cortical activity is correlated with audiovisual percepts.” Human Brain Mapping 31(4): 526–538.
Bérar, M., G. Bailly, M. Chabanas, M. Desvignes, F. Elisei, M. Odisio and Y. Pahan (2006). Towards a generic talking head. Towards a better understanding of speech production processes. J. Harrington and M. Tabain. New York, Psychology Press: 341–362.
Bérar, M., G. Bailly, M. Chabanas, F. Elisei, M. Odisio and Y. Pahan (2003). Towards a generic talking head. 6th International Seminar on Speech Production. Sydney, Australia, pp. 7–12.
Bergen, J. R., P. Anandan, K. J. Hanna and R. Hingorani (1992). Hierarchical model-based motion estimation. European Conference on Computer Vision. Santa Margherita, Italy, pp. 237–252.
Berger, K. W. (1972). “Visemes and homophenous words.” Teacher of the Deaf 70: 396–399.
Bernstein, L. E., E. T. Auer, J. K. Moore, C. W. Ponton, M. Don and M. Singh (2000a). Does Auditory Cortex Listen to Visible Speech? San Francisco, CA, Cognitive Neuroscience Society.
Bernstein, L. E., E. T. Auer, J. K. Moore, C. W. Ponton, M. Don and M. Singh (2002). “Visual speech perception without primary auditory cortex activation.” Neuroreport 13(3): 311–315.
Bernstein, L. E., E. T. Auer Jr. and J. K. Moore (2004). Audiovisual speech binding: Convergence or association? Handbook of Multisensory Processing. G. A. Calvert, C. Spence and B. E. Stein. Cambridge, MA, MIT Press: 203–223.
Bernstein, L. E., E. T. Auer Jr., M. Wagner and C. W. Ponton (2008). “Spatiotemporal dynamics of audiovisual speech processing.” Neuroimage 39(1): 423–435.
Bernstein, L. E., D. C. Coulter, M. P. O’Connell, S. P. Eberhardt and M. E. Demorest (1993). Vibrotactile and haptic speech codes. International Conference on Tactile Aids, Hearing Aids, and Cochlear Implants. Stockholm, Sweden.
Bernstein, L. E., M. E. Demorest and P. E. Tucker (1998). What makes a good speechreader? First you have to find one. Hearing by Eye II. R. Campbell, B. Dodd and D. Burnham. Hove, UK, Psychology Press: 211–227.
Bernstein, L. E., M. E. Demorest and P. E. Tucker (2000b). “Speech perception without hearing.” Perception & Psychophysics 62: 233–252.
Bernstein, L. E. and S. P. Eberhardt (1986a). Johns Hopkins Lipreading Corpus I-II: Disc 1. Baltimore, MD, Johns Hopkins University.
Bernstein, L. E. and S. P. Eberhardt (1986b). Johns Hopkins Lipreading Corpus Videodisk Set. Baltimore, MD, Johns Hopkins University.
Bernstein, L. E., P. Iverson and E. T. Auer Jr. (1997). Elucidating the complex relationships between phonetic perception and word recognition in audiovisual speech perception. Workshop on Auditory-Visual Speech Processing. Rhodes, Greece, pp. 89–92.
Bernstein, L. E., J. Jiang, A. Alwan and E. T. Auer Jr. (2001). Similarity structure in visual phonetic perception and optical phonetics. Workshop on Auditory-Visual Speech Processing. Aalborg, Denmark, pp. 50–55.
Bertelson, P., J. Vroomen and B. de Gelder (1997). Auditory-visual interaction in voice localization and in bimodal speech recognition: The effects of desynchronization. ESCA Workshop on Audio-Visual Speech Processing: Cognitive and Computational Approaches. Rhodes, Greece, pp. 97–100.
Berthommier, F. (2001). Audio-visual recognition of spectrally reduced speech. International Conference on Auditory-Visual Speech Processing. Aalborg, Denmark, pp. 183–189.
Berthommier, F. (2003). A phonetically neutral model of the low-level audiovisual interaction. Auditory-Visual Speech Processing. St-Jorioz, France, pp. 89–94.
Berthommier, F. and H. Glotin (1999). A new SNR-feature mapping for robust multistream speech recognition. International Congress on Phonetic Sciences. San Francisco, CA, pp. 711–715.
Besle, J., C. Fischer, A. Bidet-Caulet, F. Lecaignard, O. Bertrand and M. H. Giard (2008). “Visual activation and audiovisual interactions in the auditory cortex during speech perception: intracranial recordings in humans.” Journal of Neurosciences 28(52): 14301–14310.
Besle, J., A. Fort, C. Delpuech and M. H. Giard (2004). “Bimodal speech: early suppressive visual effects in human auditory cortex.” European Journal of Neuroscience 20: 2225–2234.
Beyerlein, P. (1998). Discriminative model combination. International Conference on Acoustics, Speech and Signal Processing. Seattle, WA, pp. 481–484.
Beymer, D. and T. Poggio (1996). “Image representations for visual learning.” Science 272: 1905–1909.
Beymer, D., A. Shashua and T. Poggio (1993). Example based image analysis and synthesis. Boston, MA, MIT AI Lab., Tech. Rep. no. 1431.
Bilmes, J. and C. Bartels (2005). “Graphical model architectures for speech recognition.” IEEE Signal Processing Magazine 22(5): 89–100.
Binder, J. R., J. A. Frost, T. A. Hammeke, P. S. Bellgowan, J. A. Springer, J. N. Kaufman and E. T. Possing (2000). “Human temporal lobe activation by speech and nonspeech sounds.” Cerebral Cortex 10: 512–528.
Binder, J. R., S. M. Rao, T. A. Hammeke, F. Z. Yetkin, A. Jesmanowicz, P. A. Bandettini, E. C. Wong, L. D. Estkowski, M. D. Goldstein and V. M. Haughton (2004). “Functional magnetic resonance imaging of human auditory cortex.” Annals of Neurology 35(6): 662–672.
Bingham, G. P. (1987). “Kinematic form and scaling – Further investigations on the visual perception of lifted weight.” Journal of Experimental Psychology: Human Perception and Performance 13(2): 155–177.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford, Clarendon Press.
Black, A. and P. Taylor (1997). The Festival speech synthesis system. University of Edinburgh.
Black, M., D. Fleet and Y. Yaccob (2000). “Robustly estimating changes in image appearance.” Computer Vision and Image Understanding, Special Issue on Robust Statistical Techniques in Image Understanding 78(1): 8–31.
Blake, A., R. Curwen and A. Zisserman (1993). “A framework for spatiotemporal control in the tracking of visual contours.” International Journal of Computer Vision 11: 127–145.
Blake, A. and M. Isard (1994). “3D position, attitude and shape input using video tracking of hands and lips.” Computer Graphics Proceedings (Annual Conference Series (ACM)): 185–192.
Blanz, V. and T. Vetter (1999). A morphable model for the synthesis of 3D faces. SIGGRAPH. Los Angeles, CA, pp. 187–194.
Borel, P., P. Badin, L. Revéret and G. Bailly (2000). Modélisation articulatoire linéaire 3D d’un visage pour une tête parlante virtuelle. Actes des 23èmes Journées d’Etude de la Parole. Aussois, France, pp. 121–124.
Bosseler, A. and D. W. Massaro (2004). “Development and evaluation of a computer-animated tutor for vocabulary and language learning in children with autism.” Journal of Autism and Developmental Disorders 33: 653–672.
Boston, D. W. (1973). “Synthetic facial communication.” British Journal of Audiology 7: 95–101.
Bothe, H.-H. (1996). Relations of audio and visual speech signals in a physical feature space: implications for the hearing-impaired. Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 445–460.
Bothe, H.-H., F. Rieger and R. Tackmann (1993). Visual speech and co-articulation effects. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. E634–E637.
Bould, E., N. Morris and B. Wink (2008). “Recognising subtle emotional expressions: the role of facial movements.” Cognition & Emotion 22: 1569–1587.
Bourlard, H. and S. Dupont (1996). A new ASR approach based on independent processing and recombination of partial frequency bands. International Conference on Spoken Language Processing. Philadelphia, PA, pp. 426–429.
Bradley, D. C., G. C. Chang and R. A. Andersen (1998). “Encoding of three-dimensional structure-from-motion by primate area MT neurons.” Nature 392: 714–717.
Brancazio, L. and J. L. Miller (2005). “Use of visual information in speech perception: Evidence for a visual rate effect both with and without a McGurk effect.” Perception & Psychophysics 67: 759–769.
Brand, M. (1999). Voice puppetry. SIGGRAPH’99. Los Angeles, CA, pp. 21–28.
Brand, M. and A. Hertzmann (2000). Style machines. SIGGRAPH. New Orleans, LO, pp. 183–192.
Brand, M., N. Oliver and A. Pentland (1997). Coupled hidden Markov models for complex action recognition. Conference on Computer Vision and Pattern Recognition. San Juan, Puerto Rico, pp. 994–999.
Bredin, H. and G. Chollet (2007). “Audiovisual speech synchrony measure: Application to biometrics.” EURASIP Journal on Advances in Signal Processing 1: 179–190.
Breeuwer, M. and R. Plomp (1986). “Speechreading supplemented with auditorily presented speech parameters.” Journal of the Acoustical Society of America 79(2): 481–499.
Bregler, C., M. Covell and M. Slaney (1997a). Video rewrite: visual speech synthesis from video. International Conference on Auditory-Visual Speech Processing. Rhodes, Greece, pp. 153–156.
Bregler, C., M. Covell and M. Slaney (1997b). VideoRewrite: driving visual speech with audio. SIGGRAPH’97. Los Angeles, CA, pp. 353–360.
Bregler, C., H. Hild, S. Manke and A. Waibel (1993). Improving connected letter recognition by lipreading. International Conference on Acoustics, Speech and Signal Processing. Minneapolis, MN, pp. 557–560.
Bregler, C. and Y. Konig (1994). “Eigenlips” for robust speech recognition. International Conference on Acoustics, Speech and Signal Processing. Adelaide, Australia, pp. 669–672.
Bregler, C., G. Williams, S. Rosenthal and I. McDowall (2009). Improving acoustic speaker verification with visual body-language features. ICASSP. Taipei, Taiwan, pp. 1909–1912.
Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA, MIT Press.
Bristow, D., G. Dehaene-Lambertz, J. Mattout, C. Soares, T. Gliga, S. Baillet and J.-F. Mangin (2008). “Hearing faces: How the infant brain matches the face it sees with the speech it hears.” Journal of Cognitive Neuroscience 21: 905–921.
Brooke, N. M. (1982). “Visual speech synthesis for speech perception experiments.” Journal of the Acoustical Society of America 71(S77).
Brooke, N. M. (1989). Visible speech signals: investigating their analysis, synthesis and perception. The Structure of Multimodal Dialogue. M. M. Taylor, F. Néel and D. G. Bouwhuis. Amsterdam, North-Holland: 249–258.
Brooke, N. M. (1992a). Computer graphics animations of speech production. Advances in Speech, Hearing and Language Processing. W. A. Ainsworth. London, JAI Press: 87–134.
Brooke, N. M. (1992b). Computer graphics synthesis of talking faces. Talking Machines: Theories, Models and Designs. G. Bailly and C. Benoît. Amsterdam, Elsevier: 505–522.
Brooke, N. M. (1996). Talking heads and speech recognisers that can see: the computer processing of visual speech signals. Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 351–371.
Brooke, N. M. and S. D. Scott (1994a). Computer graphics animations of talking faces based on stochastic models. International Symposium on Speech, Image-processing and Neural Networks. Hong Kong, pp. 73–76.
Brooke, N. M. and S. D. Scott (1994b). PCA image coding schemes and visual speech intelligibility. Proceedings of the Institute of Acoustics, Autumn Meeting. Windermere, England, pp. 123–129.
Brooke, N. M. and S. D. Scott (1998a). An audio-visual speech synthesiser. ETRW on Speech Technology in Language Learning. Marhölmen, Sweden, pp. 147–150.
Brooke, N. M. and S. D. Scott (1998b). Two- and three-dimensional audio-visual speech synthesis. Auditory-Visual Speech Processing Workshop (AVSP). Terrigal, Australia, pp. 213–218.
Brooke, N. M. and A. Q. Summerfield (1983). “Analysis, synthesis and perception of visible articulatory movements.” Journal of Phonetics 11: 63–76.
Brooke, N. M. and P. D. Templeton (1990). Visual speech intelligibility of digitally processed facial images. Proceedings of the Institute of Acoustics, Autumn Meeting. Windermere, England, pp. 483–490.
Brooke, N. M. and M. J. Tomlinson (2000). Processing facial images to enhance speech communications. The Structure of Multimodal Dialogue II. M. M. Taylor, F. Néel and D. G. Bouwhuis. Philadelphia, John Benjamins Publishing Company: 465–484.
Brooke, N. M., M. J. Tomlinson and R. K. Moore (1994). Automatic speech recognition that includes visual speech cues. Proceedings of the Institute of Acoustics, Autumn Meeting. Windermere, England, pp. 15–22.
Brothers, L. (1990). “The social brain: a project for integrating primate behavior and neurophysiology in a new domain.” Concepts in Neuroscience 1: 27–51.
Bruce, V. (1994). “Stability from variation: The case of face recognition. The M.D. Vernon memorial lecture.” Quarterly Journal of Experimental Psychology 47A: 5–28.
Bruce, V., D. Carson, M. A. Burton and A. W. Ellis (2000). “Perceptual priming is not a necessary consequence of semantic classification of pictures.” Quarterly Journal of Experimental Psychology (53A): 239–323.
Bruce, V. and S. Langton (1994). “The use of pigmentation and shading information in recognising the sex and identities of faces.” Perception 23: 803–822.
Bruce, V. and T. Valentine (1985). “Identity priming in face recognition.” British Journal of Psychology 76: 373–383.
Bruce, V. and T. Valentine (1988). When a nod’s as good as a wink. The role of dynamic information in facial recognition. Practical Aspects of Memory: Current Research and Issues. M. M. Gruneberg and E. Morris. Hillsdale, NJ, Erlbaum.
Bruce, V. and A. W. Young (1986). “Understanding face recognition.” British Journal of Psychology 77: 305–327.
Bruner, J. S. and R. Tagiuri (1954). The perception of people. Handbook of Social Psychology, Vol 2. Reading. G. Lindzey. Boston, MA, Addison-Wesley: 634–654.
Bruyer, R., C. Laterre, X. Seron, P. Feyereisen, E. Strypstein, E. Pierrard and D. Rectem (1983). “A case of prosopagnosia with some preserved covert remembrance of familiar faces.” Brain and Cognition 2: 257–284.
Bub, U., M. Hunke and A. Waibel (1995). Knowing who to listen to in speech recognition: Visually guided beamforming. International Conference on Acoustics, Speech and Signal Processing. Detroit, MI, pp. 848–851.
Buccino, G., F. Binkofski, G. R. Fink, L. Fadiga, L. Fogassi, V. Gallese, R. J. Seitz, K. Zilles, G. Rizzolatti and H. J. Freund (2001). “Action observation activates premotor and parietal areas in a somatotopic manner: A fMRI study.” European Journal of of Neuroscience 13: 400–404.
Buchaillard, S., P. Perrier and Y. Payan (2009). “A biomechanical model of cardinal vowel production: muscle activations and the impact of gravity on tongue positioning.” Journal of the Acoustical Society of America 126(4): 2033–2051.
Burnham, D. (1986). “Developmental loss of speech perception: Exposure to and experience with a first language.” Applied Psycholinguistics 7: 206–240.
Burnham, D. (1998). Language specificity in the development of auditory-visual speech perception. Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-visual Speech. R. Campbell, B. Dodd and D. Burnham. Hove, UK, Psychology Press: 27–60.
Burnham, D. (2000). Excavations in language development: Cross-linguistic studies of consonant and tone perception. International Conference on Human and Machine Processing of Language and Speech. Bangkok, Thailand, Chulalongkorn University Press.
Burnham, D. (2003). “Language specific speech perception and the onset of reading.” Reading and Writing: An Interdisciplinary Journal 16: 573–609.
Burnham, D., V. Ciocca, C. Lauw, S. Lau and S. Stokes (2000). Perception of visual information for Cantonese tones. Australian International Conference on Speech Science and Technology. Canberra, Australian Speech Science and Technology Association, pp. 86–91.
Burnham, D. and B. Dodd (1996). Auditory-visual speech perception as a direct process: The McGurk effect in infants and across languages. Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 103–114.
Burnham, D. and B. Dodd (1998). “Familiarity and novelty in infant cross-language studies: Factors, problems, and a possible solution.” Advances in Infancy Research 12: 170–187.
Burnham, D. and B. Dodd (2004). “Auditory-visual speech integration by pre-linguistic infants: Perception of an emergent consonant in the McGurk effect.” Developmental Psychobiology 44: 204–220.
Burnham, D., E. Francis, D. Webster, S. Luksaneeyanawin, C. Attapaiboon, F. Lacerda and P. Keller (1996). Perception of lexical tone across languages: Evidence for a linguistic mode of processing. International Conference on Spoken Language Processing, pp. 2514–2517.
Burnham, D. and S. Lau (1998). The effect of tonal information on auditory reliance in the McGurk effect. International Conference on Auditory-Visual Speech Processing. Terrigal, Australia, pp. 37–42.
Burnham, D. and S. Lau (1999). The integration of auditory and visual speech information with foreign speakers: the role of expectancy. AVSP. Santa Cruz, CA, pp. 80–85.
Burnham, D., S. Lau, H. Tam and C. Schoknecht (2001). Visual discrimination of Cantonese tone by tonal but non-Cantonese speakers, and by non-tonal language speakers. International Conference on Auditory-Visual Speech Processing. Aalborg, Denmark, pp. 155–160.
Burnham, D. and K. Mattock (2010). Auditory Development. Handbook of Infant Development: 2nd Edition. Volume I: Basic Research. G. Bremner and T. D. Wachs. Blackwood, NJ, Wiley-Blackwell.
Burnham, D., M. Tyler and S. Horlyck (2002). Periods of speech perception development and their vestiges in adulthood. An Integrated View of Language Development: Papers in Honor of Henning Wode. P. Burmeister, T. Piske and A. Rohde. Trier, Germany, Wissenschaftlicher Verlag Trier.
Burt, P. and E. H. Adelson (1983a). “A multiresolution spline with application to image mosaics.” ACM Transactions on Graphics 2(4): 217–236.
Burt, P. J. and E. H. Adelson (1983b). “The Laplacian pyramid as a compact image code.” IEEE Transactions on Communications 31(4): 532–540.
Burton, A. M., V. Bruce and P. J. B. Hancock (1999). “From pixels to people: a model of familiar face recognition.” Cognitive Science 23: 1–31.
Burton, A. M., V. Bruce and R. A. Johnston (1990). “Understanding face recognition with an interactive activation model.” British Journal of Psychology 81: 361–380.
Burton, A. M., S. Wilson, M. Cowan and V. Bruce (1999). “Face recognition in poor quality video: evidence from security surveillance.” Psychological Sciences 10: 243–248.
Butcher, N. (2009). Investigating the dynamic advantage for same and other-race faces. Unpublished PhD Thesis, University of Manchester.
Cabeza, R. and A. Kingstone (2006). Handbook of Functional Neuroimaging of Cognition. Cambridge, MA, MIT Press
Callan, D. E., A. Callan, E. Kroos and E. Vatikiotis-Bateson (2001). “Multimodal contributions to speech perception revealed by independent component analysis: a single sweep EEG study.” Cognitive Brain Research 10: 349–353.
Callan, D. E., J. A. Jones, K. Munhall, A. M. Callan, C. Kroos and E. Vatikiotis-Bateson (2003). “Neural processes underlying perceptual enhancement by visual speech gestures.” Neuroreport 14(17): 2213–2218.
Callan, D. E., J. A. Jones, K. G. Munhall, C. Kroos, A. Callan and E. Vatikiotis-Bateson (2004). “Multisensory-integration sites identified by perception of spatial wavelet filtered visual speech gesture information.” Journal of Cognitive Neuroscience 16: 805–816.
Calvert, G. A., M. J. Brammer, E. T. Bullmore, R. Campbell, S. D. Iversen and A. S. David (1999). “Response amplification in sensory-specific cortices during cross-modal binding.” Neuroreport 10: 2619–2623.
Calvert, G. A., E. T. Bullmore, M. J. Brammer, R. Campbell, S. C. Williams, P. K. McGuire, P. W. Woodruff, S. D. Iversen and A. S. David (1997). “Activation of auditory cortex during silent lipreading.” Science 276: 593–596.
Calvert, G. A. and R. Campbell (2003). “Reading speech from still and moving faces: The neural substrates of visible speech.” Journal of Cognitive Neuroscience 15: 57–70.
Calvert, G. A., R. Campbell and M. J. Brammer (2000). “Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex.” Current Biology 10: 649–657.
Calvert, G. A., C. Spence and B. E. Stein (2004). The Handbook of Multisensory Processes. Cambridge, MA, MIT Press.
Calvert, G. A. and T. Thesen (2004). “Multisensory Integration: methodological approaches and emerging principles in the human brain.” Journal of Physiology 98(1–3): 191–205.
Campbell, R. (1986). “The lateralisation of lip-reading: A first look.” Brain and Cognition 5: 1–21.
Campbell, R. (1996). Seeing brains reading speech: A review and speculations. Speechreading by Humans and Machines. D. Stork and M. Hennecke. Berlin, Springer: 115–133.
Campbell, R. (2008). “The processing of audiovisual speech.” Philosophical Transactions of the Royal Society of London B 363: 1001–1010.
Campbell, R. (2011). Speechreading: what’s missing? The Oxford Handbook of Face Processing. A. Calder, G. Rhodes and J. Haxby. Oxford University Press: 605–630
Campbell, R., B. Brooks, E. H. F. De Haan and T. Roberts (1996). “Dissociating face processing skills – decisions about lip-read speech, expression and identity.” Quarterly Journal of Experimental Psychology, Section A: Human Experimental Psychology 49(2): 295–314.
Campbell, R. and B. Dodd (1980). “Hearing by eye.” Quarterly Journal of Experimental Psychology 32: 85–99.
Campbell, R., B. Dodd and D. Burnham (1998). Hearing by Eye II. The Psychology of Speechreading and Auditory-Visual Speech. Hove, UK, Psychology Press Ltd.
Campbell, R., T. Landis and M. Regard (1986). “Face recognition and lipreading: a neurological dissociation.” Brain 109: 509–521.
Campbell, R., M. MacSweeney, S. Surguladze, G. Calvert, P. K. McGuire, J. Suckling, M. J. Brammer and A. S. David (2001). “Cortical substrates for the perception of face actions: An fMRI study of the specificity of activation for seen speech and for meaningless lower face acts.” Cognitive Brain Research 12: 233–243.
Campbell, R., J. Zihl, D. Massaro, K. Munhall and M. M. Cohen (1997). “Speechreading in the akinetopsic patient, L.M.” Brain 120: 1793–1803.
Cao, Y., P. Faloutsos, E. Kohler and F. Pighin (2004). Real-time speech motion Synthesis from recorded motions. ACM Siggraph/Eurographics Symposium on Computer Animation. Grenoble, France, pp. 345–353.
Capek, C. M., D. Bavelier, D. Corina, A. J. Newman, P. Jezzard and H. J. Neville (2004). “The cortical organization of audio-visual sentence comprehension: an fMRI study at 4 Tesla.” Brain Research: Cognitive Brain Research 20: 111–119.
Capek, C. M., R. Campbell, M. MacSweeney, B. Woll, M. Seal, D. Waters, A. S. Davis, P. K. McGuire and M. J. Brammer (2005). The organization of speechreading as a function of attention. Cognitive Neuroscience Society Annual Meeting, poster presentation.
Capek, C. M., M. Macsweeney, B. Woll, D. Waters, P. K. McGuire, A. S. David, M. J. Brammer and R. Campbell (2008a). “Cortical circuits for silent speechreading in deaf and hearing people.” Neuropsychologia 46(5): 1233–1241.
Capek, C. M., D. Waters, B. Woll, M. MacSweeney, M. J. Brammer, P. K. McGuire, A. S. David and R. Campbell (2008b). “Hand and mouth: cortical correlates of lexical processing in British Sign Language and speechreading English.” Journal of Cognitive Neuroscience 20(7): 1220–1234.
Capek, C. M., B. Woll, M. MacSweeney, D. Waters, P. K. McGuire, A. S. David, M. J. Brammer and R. Campbell (2010). “Superior temporal activation as a function of linguistic knowledge: insights from deaf native signers who speechread.” Brain and Language 112(2): 129–134.
Carlson, R. and B. Granström (1976). “A text-to-speech system based entirely on rules.” IEEE International Conference on Acoustics, Speech, and Signal Processing: 686–688.
Cassell, J., J. Sullivan, S. Prevost and E. Churchill (2000). Embodied Conversational Agents. Cambridge, MA, MIT Press.
Cassell, J. and A. Tartaro (2007). “Intersubjectivity in human-agent interaction.” Interaction Studies 8(3): 391–410.
Catford, J. C. (1977). Fundamental Problems in Phonetics. Indiana University Press.
Cathiard, M.-A. (1994). La perception visuelle de l’anticipation des gestes vocaliques: cohérence des évènements audibles et visibles dans le flux de la parole. Doctorat de Psychologie Cognitive, Université Pierre Mendès France, Grenoble.
Cathiard, M.-A. and C. Abry (2007). Speech structure decisions from speech motion coordina-tions. International Congress of Phonetic Sciences. Saarbrücken, Germany, pp. 291–296.
Cathiard, M.-A., C. Abry and M.-T. Lallouache (1996). Does movement on the lips mean movement in the mind? Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 211–219.
Cathiard, M.-A., C. Abry and J.-L. Schwartz (1998). Visual perception of glides versus vowels. International Conference on Auditory-visual Speech Processing. Terrigal, Australia, pp. 115–120.
Cathiard, M.-A., V. Attina and D. Alloatti (2003). Labial anticipation behavior during speech with and without Cued Speech. International Conference of Phonetic Sciences. Barcelona, Spain, pp. 1939–1942.
Cathiard, M.-A., F. Bouaouni, V. Attina and D. Beautemps (2004). Etude perceptive du décours de l’information manuo-faciale en Langue Française Parlée Complétée. Journées d’Etudes de la Parole. Fez, Morocco, pp. 113–116.
Cathiard, M.-A., T. Lallouache, T. Mohamadi and C. Abry (1995). Configurational vs. temporal coherence in audiovisual speech perception. International Congress of Phonetic Sciences. Stockholm, Sweden, pp. 218–221.
Cathiard, M.-A., J.-L. Schwartz and C. Abry (2001). Asking a naive question to the McGurk effect : why does audio [b] give more [d] percepts with visual [g] than with visual [d]?International Auditory-Visual Speech Processing Workshop. Scheelsminde, Denmark, pp. 138–142.
Chabanas, M. and Y. Payan (2000). A 3D Finite Element model of the face for simulation in plastic and maxillo-facial surgery. International Conference on Medical Image Computing and Computer-Assisted Interventions. Pittsburgh, USA, pp. 1068–1075.
Chan, M. T. (2001). HMM based audio-visual speech recognition integrating geometric- and appearance-based visual features. Workshop on Multimedia Signal Processing. Cannes, France, pp. 9–14.
Chan, M. T., Y. Zhang and T. S. Huang (1998). Real-time lip tracking and bimodal continuous speech recognition. Workshop on Multimedia Signal Processing. Redondo Beach, CA, pp. 65–70.
Chandler, J. P. (1969). “Subroutine STEPIT – Finds local minima of a smooth function of several parameters.” Behavioral Science 14: 81–82.
Chandramohan, D. and P. L. Silsbee (1996). A multiple deformable template approach for visual speech recognition. International Conference on Spoken Language Processing. Philadelphia, PA, pp. 50–53.
Chandrasekaran, C., A. Trubanova, S. Stillittano, A. Caplier and A. A. Ghazanfar (2009). “The natural statistics of audiovisual speech.” PLoS Computational Biology 5(7): e1000436.
Charlier, B. L. and J. Leybaert (2000). “The rhyming skill of children educated with phonetically augmented speechreading.” Quarterly Journal of Experimental Psychology A, 53: 349–375.
Charpentier, F. and E. Moulines (1990). “Pitch-synchronous waveform processing techniques for text-to-speech using diphones.” Speech Communication 9(5–6): 453–467.
Chatfield, C. and A. J. Collins (1991). Introduction to Multivariate Analysis. London, Chapman and Hall.
Chaudhari, U. V., G. N. Ramaswamy, G. Potamianos and C. Neti (2003). Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction. International Conference on Multimedia and Expo. Baltimore, MD, pp. 9–12.
Chen, S. E. and L. Williams (1993). View interpolation for image synthesis. SIGGRAPH. Anaheim, CA, pp. 279–288.
Chen, T. (2001). “Audiovisual speech processing. Lip reading and lip synchronization.” IEEE Signal Processing Magazine 18(1): 9–21.
Chen, T., H. P. Graf and K. Wang (1995). “Lip synchronization using speech-assisted video processing.” IEEE Signal Processing Letters 2(4): 57–59.
Chen, T. and D. W. Massaro (2004). “Mandarin speech perception by ear and eye follows a universal principle.” Perception & Psychophysics 66: 820–836.
Chen, T. and R. R. Rao (1998). “Audio-visual integration in multimodal communication.” Proceedings of the IEEE 86(5): 837–851.
Chibelushi, C. C., F. Deravi and J. S. D. Mason (1996). Survey of audio visual speech databases. Swansea, United Kingdom, Department of Electrical and Electronic Engineering, University of Wales.
Chibelushi, C. C., F. Deravi and J. S. D. Mason (2002). “A review of speech-based bimodal recognition.” IEEE Transactions on Multimedia 4(1): 23–37.
Chiou, G. and J.-N. Hwang (1997). “Lipreading from color video.” IEEE Transactions on Image Processing 6(8): 1192–1195.
Chitoran, I. and A. Nevins (2008). “Studies on the phonetics and phonology of glides.” Lingua: Special issue 118(12).
Choi, K. H., Y. Luo and J.-N. Hwang (2001). “Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system.” Journal of VLSI Signal Processing – Systems for Signal, Image, and Video Technology 29(1–2): 51–62.
Chou, W., B.-H. Juang, C.-H. Lee and F. Soong (1994). “A minimum error rate pattern recognition approach to speech recognition.” International Journal of Pattern Recognition and Artificial Intelligence 8(1): 5–31.
Christie, F. and V. Bruce (1998). “The role of dynamic information in the recognition of unfamiliar faces.” Memory and Cognition 26(4): 780–790.
Chu, S. and T. Huang (2000). Bimodal speech recognition using coupled hidden Markov models. International Conference on Spoken Language Processing. Beijing, China, pp. 747–750.
Chu, S. M. and T. S. Huang (2002). Audio-visual speech modeling using coupled hidden Markov models. International Conference on Acoustics, Speech and Signal Processing. Orlando, FL, pp. 2009–2012.
Chuang, E. and C. Bregler (2005). “Mood swings: expressive speech animations.” ACM Transactions on Graphics 24(2): 331–347.
Cisař, P., M. Železný, Z. Krňoul, J. Kanis, J. Zelinka and L. Müller (2005). Design and recording of Czech speech corpus for audio-visual continuous speech recognition. International Conference on Auditory-Visual Speech Processing. Vancouver Island, Canada, pp. 93–96.
Code, C. (2005). Syllables in the brain: evidence from brain damage. Phonological Encoding and Monitoring in Normal and Pathological Speech. R. J. Hartsuiker, R. Bastiaanse, A. Postma and F. N. K. Wijnen. Hove, Sussex, Psychology Press.
Cohen, I., N. Sebe, A. Garg, L. S. Chen and T. S. Huang (2003). “Facial expression recognition from video sequences: temporal and static modeling.” Computer Vision and Image Understanding 91(1–2): 160–187.
Cohen, M. M., J. Beskow and D. W. Massaro (1998). Recent developments in facial animation: an inside view. Auditory-Visual Speech Processing Workshop (AVSP). Terrigal, Sydney, Australia, pp. 201–206.
Cohen, M. M. and D. W. Massaro (1990). “Synthesis of visible speech.” Behavior Research Methods: Instruments & Computers 22: 260–263.
Cohen, M. M. and D. W. Massaro (1993). Modeling coarticulation in synthetic visual speech. Models and Techniques in Computer Animation. D. Thalmann and N. Magnenat-Thalmann. Tokyo, Springer: 141–155.
Cohen, M. M. and D. W. Massaro (1994a). “Development and experimentation with synthetic visible speech.” Behavioral Research Methods and Instrumentation 26: 260–265.
Cohen, M. M. and D. W. Massaro (1994b). What can visual speech synthesis tell visual speech recognition? Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, CA.
Cohen, M. M., D. W. Massaro and R. Clark (2002). Training a talking head. Fourth International Conference on Multimodal Interfaces. Pittsburgh, Pennsylvania.
Cohen, M. M., R. L. Walker and D. W. Massaro (1995). Perception of synthetic visual speech. Speechreading by Man and Machine:Models, Systems and Applications. Chateau de Bonas, France, NATO Advanced Study Institute 940584.
Cohen, M. M., R. L. Walker and D. W. Massaro (1996). Perception of synthetic visual speech. Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. New York, Springer: 153–168.
Coker, C. H. (1968). Speech synthesis with a parametric articulatory model. Reprinted in Speech Synthesis. J. L. Flanagan and L. R. Rabiner. Stroudsburg, PA, Dowden, Hutchinson and Ross: 135–139.
Cole, R., D. W. Massaro, J. de Villiers, B. Rundle, K. Shobaki, J. Wouters, M. Cohen, J. Beskow, P. Stone, P. Connors, A. Tarachow and D. Solcher (1999). New tools for interactive speech and language training: Using animated conversational agents in the classrooms of profoundly deaf children. ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education. London, UK.
Cole, R. A., R. Stern, M. S. Phillips, S. M. Brill, P. A. P. and P. Specker (1983). Feature-based speaker independent recognition of English letters. International Conference on Acoustics, Speech, and Signal Processing, pp. 731–734.
Colin, C., M. Radeau, A. Soquet, D. Demolin, F. Colin and P. Deltenre (2002). “Mismatch negativity evoked by the McGurk-MacDonald effect: a phonetic representation within short-term memory.” Clinical Neurophysiology 113: 495–506.
Connell, J. H., N. Haas, E. Marcheret, C. Neti, G. Potamianos and S. Velipasalar (2003). A real-time prototype for small-vocabulary audio-visual ASR. International Conference on Multimedia and Expo. Baltimore, MD, pp. 469–472.
Connine, C. M., D. G. Blasko and J. Wang (1994). “Vertical similarity in spoken word recognition: Multiple lexical activation, individual differences, and the role of sentence context.” Perception & Psychophysics 56: 624–636.
Cooke, M., J. Barker, S. Cunningham and X. Shao (2006). “An audio-visual corpus for speech perception and automatic speech recognition.” Journal of the Acoustical Society of America 120(5): 2421–2424.
Cooper, F. S., P. C. Delattre, A. M. Liberman, J. M. Borst and L. J. Gerstman (1952). “Some experiments on the perception of synthetic speech sounds.” Journal of the Acoustical Society of America 24: 597–606.
Cootes, T. F., G. J. Edwards and C. J. Taylor (1998). Active appearance models. European Conference on Computer Vision. Freiburg, Germany, pp. 484–498.
Cootes, T. F., C. J. Taylor, D. H. Cooper and J. Graham (1995). “Active shape models – their training and application.” Computer Vision and Image Understanding 61(1): 38–59.
Cormen, T. H., C. E. Leiserson, R. L. Rivest and C. Stein (1989). Introduction to Algorithms. Boston, MA, The MIT Press and McGraw-Hill Book Company.
Cornett, R. O. (1967). “Cued Speech.” American Annals of the Deaf 112: 3–13.
Cornett, R. O. (1982). Le Cued Speech. Aides manuelles à la lecture labiale et perspectives d’aides automatiques. F. Destombes. Paris, Centre scientifique IBM-France.
Cornett, R. O. (1988). “Cued Speech, manual complement to lipreading, for visual reception of spoken language.” Principles, practice and prospects for automation. Acta Oto-Rhino-Laryngologica Belgica 42(3): 375–384.
Cornett, R. O. and M. E. Daisey (1992). The cued speech resource book for parents of deaf children. Raleigh, NC, The National Cued Speech Association, Inc.
Cosatto, E. and H. P. Graf (1998). Sample-based synthesis of photo-realistic talking heads. Computer Animation. Philadelphia, PA, pp. 103–110.
Cosatto, E., G. Potamianos and H. P. Graf (2000). Audio-visual unit selection for the synthesis of photo-realistic talking-heads. International Conference on Multimedia and Expo. New York, NY, pp. 1097–1100.
Cosi, P., E. Caldognetto, G. Perin and C. Zmarich (2002a). Labial coarticulation modeling for realistic facial animation. IEEE International Conference on Multimodal Interfaces. Pittsburgh, PA.
Cosi, P., M. M. Cohen and D. W. Massaro (2002b). Baldini: Baldi speaks Italian. Inter-national Conference on Spoken Language Processing. Denver, CO, pp. 2349–2352.
Cosker, D., D. Marshall, P. Rosin, S. Paddock and S. Rushton (2005). “Towards perceptually realistic talking heads: models, metrics and McGurk.” ACM Transactions on Applied Perception (TAP) 2(3): 270–285.
Couteau, B., Y. Payan and S. Lavallée (2000). “The Mesh-Matching algorithm : an automatic 3D mesh generator for finite element structures.” Journal of Biomechanics 33(8): 1005–1009.
Covell, M. and C. Bregler (1996). Eigenpoints. International Conference on Image Processing. Lausanne, Switzerland, pp. 471–474.
Cox, S., I. Matthews and A. Bangham (1997). Combining noise compensation with visual information in speech recognition. European Tutorial Workshop on Audio-Visual Speech Processing. Rhodes, Greece, pp. 53–56.
Crawford, R. (1995). “Teaching voiced velar stops to profoundly deaf children using EPG, two case studies.” Clinical Linguistics and Phonetics 9: 255–270.
Czap, L. (2000). Lip representation by image ellipse. International Conference on Spoken Language Processing. Beijing, China, pp. 93–96.
Dagenais, P. A., P. Critz-Crosby, S. G. Fletcher and M. J. McCutcheon (1994). “Comparing abilities of children with profound hearing impairments to learn consonants using electropalatography or traditional aural-oral techniques.” Journal of Speech and Hearing Research 37: 687–699.
Dalton, B., R. Kaucic and A. Blake (1996). Automatic speechreading using dynamic contours. Speechreading by Humans and Machines. D. G. Stork and M. E. Hennecke. Berlin, Springer: 373–382.
Damasio, A. R., D. Tranel and H. Damasio (1990). “Varieties of face agnosia – Different neuropsychological profiles and neural substrates.” Journal of Clinical and Experimental Neuropsychology 12(1): 675–687.
Damasio, H. and A. R. Damasio (1989). Lesion Analysis in Neuropsychology. Oxford University Press.
Dang, J. and K. Honda (1998). Speech production of vowel sequences using a physiological articulatory model. International Conference on Spoken Language Processing. Sydney, Australia, pp. 1767–1770.
Dang, J. and K. Honda (2004). “Construction and control of a physiological articulatory model.” Journal of Acoustical Society of America 115(2): 853–870.
Daubechies, I. (1992). Wavelets. Philadelphia, PA, S.I.A.M.
Davis, B. L. and P. F. MacNeilage (1995). “The articulatory basis of babbling.” Journal of Speech and Hearing Research 38: 1199–1211.
Davis, B. L. and P. F. MacNeilage (2000). “An embodiment perspective on the acquisition of speech perception.” Phonetica 57(2–4): 229–241.
Davoine, F., H. Li and R. Forchheimer (1997). Video compression and person authentication. Audio-and Video-based Biometric Person Authentication. J. Bigün, G. Chollet and G. Borgefors. Berlin, Springer: 353–360.
De Cuetos, P., C. Neti and A. Senior (2000). Audio-visual intent to speak detection for human computer interaction. International Conference on Acoustics, Speech and Signal Processing. Istanbul, Turkey, pp. 1325–1328.
de Gelder, B., P. Bertelson, J. Vroomen and H. C. Chen (1995). Interlanguage differences in the McGurk effect for Dutch and Cantonese listeners. European Conference on Speech Communication and Technology. Madrid, Spain, pp. 1699–1702.
de Gelder, B. and J. Vroomen (1992). Auditory and visual speech perception in alphabetic and nonalphabetic Chinese/Dutch bilinguals. Cognitive Processing in Bilinguals. R. J. Harris. Amsterdam, Elsevier: 413–426.
de Gelder, B., J. Vroomen and L. van der Heide (1991). “Face recognition and lipreading in autism.” European Journal of Cognitive Psychology 3: 69–86.
de Haan, E. H. F., A. Young and F. Newcombe (1987). “Face recognition without awareness.” Cognitive Neuropsychology 4: 385–415.
De la Torre, A., A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benítez and A. J. Rubio (2005). “Histogram equalization of speech representation for robust speech recognition.” IEEE Transactions on Speech and Audio Processing 13(3): 355–366.
Decety, J., D. Perani, M. Jeannerod, V. Bettinardi, B. Tadary, R. Woods, J. C. Mazziotta and F. Fazio (1994). “Mapping motor representations with positron emission tomography.” Nature 371: 600–602.
Delattre, P. C., A. M. Liberman and F. S. Cooper (1955). “Acoustic loci and transitional cues for consonants.” Journal of the Acoustical Society of America 27: 769–773.
Deligne, S., G. Potamianos and C. Neti (2002). Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization). International Conference on Spoken Language Processing. Denver, CO, pp. 1449–1452.
Deller Jr., J. R., J. G. Proakis and J. H. L. Hansen (1993) Discrete-Time Processing of Speech Signals. Englewood Cliffs, NJ, Macmillan Publishing Company.
Demorest, M. E., L. E. Bernstein and G. P. DeHaven (1996). “Generalizability of speechreading performance on nonsense syllables, words, and sentences: Subjects with normal hearing.” Journal of Speech and Hearing Research 39(4): 697–713.
Dempster, A. P., N. M. Laird and D. B. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society 39(1): 1–38.
Dent, H., F. Gibbon and W. Hardcastle (1992). Inhibiting an abnormal lingual pattern in a cleft palate child using electropalatography (EPG). Interdisciplinary Perspectives in Speech and Language Pathology. M. M. Leahy and J. L. Kallen. Dublin, Trinity College: 211–221.
Dent, H., F. Gibbon and W. Hardcastle (1995). “The application of electropalatography (EPG) to the remediation of speech disorders in school-aged children and young adults.” European Journal of Disorders of Communication 30: 264–277.
dePaula, H., H. C. Yehia, D. Shiller, G. Jozan, K. G. Munhall and E. Vatikiotis-Bateson (2006). Analysis of audiovisual speech intelligibility based on spatial and temporal filtering of visible speech information. Speech Production: Models, Phonetic Processes, and Techniques. J. Harrington and M. Tabain. Hove, Psychology Press: 135–147.
Desjardins, R. N., J. Rogers and J. F. Werker (1997). “An exploration of why preschoolers perform differently than do adults in audiovisual speech perception tasks.” Journal of Experimental Child Psychology 66: 85–110.
Desjardins, R. N. and J. F. Werker (1996). 4-month-old female infants are influenced by visible speech. International Conference of Infant Studies. Providence, RI.
Desjardins, R. N. and J. F. Werker (2004). “Is the integration of heard and seen speech mandatory for infants?Developmental Psychobiology 45: 187–203.
Dittmann, A. T. (1972). “Developmental factors in conversational behaviour.” Journal of Communication 22: 404–423.
Dittrich, W. H., T. Troscianko, S. E. G. Lea and D. Morgan (1996). “Perception of emotion from dynamic point-light displays represented in dance.” Perception 25: 727–738.
Dodd, B. (1979). “Lipreading in infants: Attention to speech presented in and out of synchrony.” Cognitive Psychology 11: 478–484.
Dodd, B. and D. K. Burnham (1988). “Processing speechread information.” The Volta Review: New Reflections on Speechreading 90: 45–60.
Dodd, B. and R. Campbell (1987). Hearing by Eye: The Psychology of Lipreading. London, Erlbaum.
Dodd, B., B. McIntosh and L. Woodhouse (1998). Early lipreading ability and speech and language development of hearing-impaired pre-schoolers. Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-visual Speech. R. Campbell, B. Dodd and D. Burnham. Hove, UK, Psychology Press: 229–242.
Dohen, M., J.-L. Schwartz and G. Bailly (2010). “Speech and Face-to-face Communication – An Introduction.” Speech Communication 52: 477–480.
Dougherty, E. R. and C. R. Giardina (1987). Image Processing – Continuous to Discrete. Vol. I: Geometric, Transform, and Statistical Methods. Englewood Cliffs, NJ, Prentice Hall.
Driver, J. (1996). “Enhancement of selected listening by illusory mislocation of speech sounds due to lip-reading.” Nature 381: 66–68.
Droppo, J. and A. Acero (2008). Environmental robustness. Springer Handbook of Speech Processing. J. Banesty, M. M. Sondhi and Y. Huang. Berlin, Springer: 653–679.
Dryden, I. L. and K. V. Mardia (1998). Statistical Shape Analysis. London, John Wiley and Sons.
Dubeau, M. C., M. Iacoboni, L. Koski, J. Markovac and J. C. Mazziotta (2002). Topography for body-parts motion in the posterior STS region. Perspectives on Imitation:from cognitive neuroscience to social venue. Royaumont Abbey, France.
Duchenne, G. B. (1990). The Mechanism of Human Facial Expression. Cambridge University Press.
Duchnowski, P., D. S. Lum, J. C. Krause, M. G. Sexton, M. S. Bratakos and L. D. Braida (2000). “Development of speechreading supplements based on automatic speech recognition.” IEEE Transactions on Biomedical Engineering 47(4): 487–496.
Duchnowski, P., U. Meier and A. Waibel (1994). See me, hear me: Integrating automatic speech recognition and lip-reading. International Conference on Spoken Language Processing. Yokohama, Japan, pp. 547–550.
Dupont, S. and J. Luettin (2000). “Audio-visual speech modeling for continuous speech recognition.” IEEE Transactions on Multimedia 2(3): 141–151.
Duraffour, A. (1969). Glossaire des patois franco-provençaux. Paris, Editions du CNRS.
Echternach, M., J. Sundberg, S. Arndt, T. Breyer, M. Markl, M. Schumacher and B. Richter (2008). “Vocal tract and register changes analysed by real-time MRI in male professional singers – a pilot study.” Logopedics Phoniatrics Vocology 33(2): 67–73.
Edwards, J., F. Gibbon and M. Fourakis (1997). “On discrete changes in the acquisition of the alveolar/velar stop contrast.” Language and Speech 40.
Edwards, K. (1998). “The face of time: Temporal cues in facial expression of emotion.” Psychological Sciences 9(4): 270–276.
Efros, A. A. and T. K. Leung (1999). Texture synthesis by non-parametric sampling. IEEE International Conference on Computer Vision, pp. 1033–1038.
Eisert, P. and B. Girod (1998). “Analyzing facial expressions for virtual conferencing.” IEEE Computer Graphics & Applications: Special Issue: Computer Animation for Virtual Humans 18(5): 70–78.
Ekman, P. (1979). About brows: emotional and conversational signals. Human Ethology. M. von Cranach, K. Foppa, W. Lepenies and D. Ploog. Cambridge University Press: 169–249.
Ekman, P. (1982). Emotion and the Human Face. Cambridge University Press
Ekman, P. and W. Friesen (1978). Facial Action Coding System (FACS): A technique for the measurement of facial action. Palo Alto, CA, Consulting Psychologists Press.
Ekman, P. and W. Friesen (2003). Facial Action Coding System. Palo Alto, CA, Consulting Psychologists Press Inc.
Ekman, P. and W. V. Friesen (1982). “Felt, false, and miserable smiles.” Journal of Nonverbal Behavior 6(4): 238–252.
Ekman, P., W. V. Friesen and R. C. Simons (1985). “Is the startle reaction an emotion.” Journal of Personality and Social Psychology 49(5): 1416–1426.
Ekman, P. and H. Oster (1979). “Facial expressions of emotion.” Annual Reviews of Psychology 30: 527–554.
Elisei, F., G. Bailly, M. Odisio and P. Badin (2001a). Clones parlants vidéo-réalistes: application à l’interprétation de FAP-MPEG4. CORESA. Dijon, France, pp. 145–148.