Published online by Cambridge University Press: 05 July 2012
Segmenting multi-party conversations into homogeneous speaker regions is a fundamental step towards automatic understanding of meetings. This information is used for multiple purposes as adaptation for speaker and speech recognition, as a meta-data extraction tool to navigate meetings, and also as input for automatic interaction analysis.
This task is referred to as speaker diarization and aims at inferring “who spoke when” in an audio stream involving two simultaneous goals: (1) the estimation of the number of speakers in an audio stream and (2) associating each speech segment with a speaker.
Diarization algorithms have been developed extensively for broadcast data, characterized by regular speaker turns, prompted speech, and high-quality audio, while processing meeting recordings presents different needs and additional challenges. From one side, the conversational nature of the speech involves very short turns and large amounts of overlapping speech; from the other side, the audio is acquired in a nonintrusive way using far-field microphones and is thus corrupted with ambient noise and reverberation. Furthermore real-time and online processing are often required in order to enable the use of many applications while the meeting is actually going on. The next section briefly reviews the state-of-the-art in the field.
State of the art in speaker diarization
Conventional speaker diarization systems are composed of the following steps: a feature extraction module that extracts acoustic features like mel-frequency cepstral coefficients (MFCCs) from the audio stream, a Speech/Non-speech Detection which extracts only the speech regions discarding silence, an optional speaker change module which divides the input stream into small homogeneous segments uttered by a single speaker, and an agglomerative hierarchical clustering step which groups together those speech segments into the same cluster.