Speech recognition

Ian Vince McLoughlin

doi:10.1017/CBO9781316084205.011

Having considered big data in the previous chapter, we now turn our attention to speech recognition – probably the one area of speech research that has gained the most from machine learning techniques. In fact, as discussed in the introduction to Chapter 8, it was only through the application of well-trained machine learning methods that automatic speech recognition (ASR) technology was able to advance beyond a decades long plateau that limited performance, and hence the spread of further applications.

What is speech recognition?

Entire texts have been written on the subject of speech recognition, and this topic alone probably accounts for more than half of the recent research literature and computational development effort in the fields of speech and audio processing. There are good reasons for this interest, primarily driven by the wish to be able to communicate more naturally with a computer (i.e. without the use of a keyboard and mouse). This is a wish which has been around for almost as long as electronic computers have been with us. From a historical perspective we might consider identifying a hierarchy of mainstream human– computer interaction steps as follows:

Hardwired: The computer designer (i.e. engineer) ‘reprograms’ a computer, and provides input by reconnecting wires and circuits.

Card: Punched cards are used as input, printed tape as output.

Paper: Teletype input is used directly, and printed paper as output.

Alphanumeric: Electronic keyboards and monitors (visual display units), alphanumeric data.

Graphical: Mice and graphical displays enable the rise of graphical user interfaces (GUIs).

WIMP: Standardised methods of windows, icons, mouse and pointer (WIMP) interaction become predominant.

Touch: Touch-sensitive displays, particularly on smaller devices.

Speech commands: Nascent speech commands (such as voice dialling, voice commands, speech alerts), plus workable dictation capabilities and the ability to read back selected text.

Natural language: We speak to the computer in a similar way to a person, it responds similarly.

Anticipatory: The computer understands when we speak to it just like a close friend, husband or wife would, often anticipating what we will say, understanding the implied context as well as shared references or memories of past events.

Book contents

9 - Speech recognition

Summary

Access options

Book contents

9 - Speech recognition

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive