Speaker diarization

doi:10.1017/CBO9781139136310.004

4 - Speaker diarization

Published online by Cambridge University Press: 05 July 2012

Fabio Valente and

Edited by

Jean Carletta and

Fabio Valente: Affiliation:
Idiap Research Institute, Martigny, Switzerland
Gerald Friedland: Affiliation:
International Computer Science Institute, Berkeley
Steve Renals: Affiliation:
University of Edinburgh
Hervé Bourlard: Affiliation:
Idiap Research Institute
Jean Carletta: Affiliation:
University of Edinburgh
Andrei Popescu-Belis: Affiliation:
Idiap Research Institute, Martigny, Switzerland

Book contents

Get access

Summary

Introduction

Segmenting multi-party conversations into homogeneous speaker regions is a fundamental step towards automatic understanding of meetings. This information is used for multiple purposes as adaptation for speaker and speech recognition, as a meta-data extraction tool to navigate meetings, and also as input for automatic interaction analysis.

This task is referred to as speaker diarization and aims at inferring “who spoke when” in an audio stream involving two simultaneous goals: (1) the estimation of the number of speakers in an audio stream and (2) associating each speech segment with a speaker.

Diarization algorithms have been developed extensively for broadcast data, characterized by regular speaker turns, prompted speech, and high-quality audio, while processing meeting recordings presents different needs and additional challenges. From one side, the conversational nature of the speech involves very short turns and large amounts of overlapping speech; from the other side, the audio is acquired in a nonintrusive way using far-field microphones and is thus corrupted with ambient noise and reverberation. Furthermore real-time and online processing are often required in order to enable the use of many applications while the meeting is actually going on. The next section briefly reviews the state-of-the-art in the field.

State of the art in speaker diarization

Conventional speaker diarization systems are composed of the following steps: a feature extraction module that extracts acoustic features like mel-frequency cepstral coefficients (MFCCs) from the audio stream, a Speech/Non-speech Detection which extracts only the speech regions discarding silence, an optional speaker change module which divides the input stream into small homogeneous segments uttered by a single speaker, and an agglomerative hierarchical clustering step which groups together those speech segments into the same cluster.

Type: Chapter
Information: Multimodal Signal Processing
Human Interactions in Meetings
, pp. 40 - 55

DOI: https://doi.org/10.1017/CBO9781139136310.004 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

4 - Speaker diarization

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive