Statistical Natural Language Processing

M. Lothaire

doi:10.1017/CBO9781107341005.005

Introduction

The application of statistical methods to natural language processing has been remarkably successful over the past two decades. The wide availability of text and speech corpora has played a critical role in their success since, as for all learning techniques, these methods rely heavily on data. Many of the components of complex natural language processing systems, for example, text normalizers, morphological or phonological analyzers, part-of-speech taggers, grammars or language models, pronunciation models, context-dependency models, acoustic Hidden-Markov Models (HMMs), are statistical models derived from large data sets using modern learning techniques. These models are often given as weighted automata or weighted finite-state transducers either directly or as a result of the approximation of more complex models.

Weighted automata and transducers are the finite automata and finite-state transducers described in Chapter 1 Section 1.5 with the addition of some weight to each transition. Thus, weighted finite-state transducers are automata in which each transition, in addition to its usual input label, is augmented with an output label from a possibly different alphabet, and carries some weight. The weights may correspond to probabilities or log-likelihoods or they may be some other costs used to rank alternatives. More generally, as we shall see in the next section, they are elements of a semiring set. Transducers can be used to define a mapping between two different types of information sources, for example, word and phoneme sequences.

Book contents

Chapter 4 - Statistical Natural Language Processing

Summary

Access options

Book contents

Chapter 4 - Statistical Natural Language Processing

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive