To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure email@example.com
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
As a starting point we study finite-state automata, which represent the simplest devices for recognizing languages. The theory of finite-state automata has been described in numerous textbooks both from a computational and an algebraic point of view. Here we immediately look at the more general concept of a monoidal finite-state automaton, and the focus of this chapter is general constructions and results for finite-state automata over arbitrary monoids and monoidal languages. Refined pictures for the special (and more standard) cases where we only consider free monoids or Cartesian products of monoids will be given later.
The aim of this chapter is twofold. First, we recall a collection of basic mathematical notions that are needed for the discussions of the following chapters. Second, we have a first, still purely mathematical, look at the central topics of the book: languages, relations and functions between strings, as well as important operations on languages, relations and functions. We also introduce monoids, a class of algebraic structures that gives an abstract view on strings, languages, and relations.
Classical finite-state automata represent the most important class of monoidal finite-state automata. Since the underlying monoid is free, this class of automaton has several interesting specific features. We show that each classical finite-state automaton can be converted to an equivalent classical finite-state automaton where the transition relation is a function. This form of ‘deterministic’ automaton offers a very efficient recognition mechanism since each input word is consumed on at most one path. The fact that each classical finite-state automaton can be converted to a deterministic automaton can be used to show that the class of languages that can be recognized by a classical finite-state automaton is closed under intersections, complements, and set differences. The characterization of regular languages and deterministic finite-state automata in terms of the ‘Myhill–Nerode equivalence relation’ to be introduced in the chapter offers an algebraic view on these notions and leads to the concept of minimal deterministic automata.
A fundamental task in natural language processing is the efficient representation of lexica. From a computational viewpoint, lexica need to be represented in a way directly supporting fast access to entries, and minimizing space requirements. A standard method is to represent lexica as minimal deterministic (classical) finite-state automata. To reach such a representation it is of course possible to first build the trie of the lexicon and then to minimize this automaton afterwards. However, in general the intermediate trie is much larger than the resulting minimal automaton. Hence a much better strategy is to use a specialized algorithm to directly compute the minimal deterministic automaton in an incremental way. In this chapter we describe such a procedure.
This chapter describes a special construction based on finite-state automata with important applications: the Aho–Corasick algorithm is used to efficiently find all occurrences of a finite set of strings (also called pattern set, or dictionary) in a given input string, called the ‘text’. Search is ‘online’, which means that the input text is neither fixed nor preprocessed in any way. This problem is a special instance of pattern matching in strings, and other automata constructions are used to solve other pattern matching tasks. From an automaton point of view, the Aho–Corasick algorithm comes in two variants. We first present the more efficient version where a classical deterministic finite-state automaton is built for text search. The disadvantage of this first construction is that the resulting automaton can become very large, in particular for large pattern alphabets. Afterwards we present the second version, where an automaton with additional transitions of a particular kind is built, yielding a much smaller device for text search.
A common task arising in many contexts is rewriting parts of a given input string to another form. Subparts of the input that match specific conditions are replaced by other output parts. In this way, the complete input string is translated to a new output form. Due to the importance of text rewriting, many programming languages offer matching/rewriting operations for subexpressions of strings, also called replace rules. When using strictly regular relations and functions for representing replace rules, a cascade of replace rules can be composed into a single transducer. If the transducer is functional, an equivalent bimachine or (in some cases) a subsequential transducer can be built, thus achieving theoretically and practically optimal text processing speed. In this chapter we introduce basic constructions for building text rewriting transducers and bimachines from replace rules and provide implementations. A first simple version in general leads to an ambiguous form of text rewriting with several outputs. A second more sophisticated construction solves conflicts using the leftmost-longest match strategy and leads to functional devices.
An important generalization of classical finite-state automata are multi-tape automata, which are used for recognizing relations of a particular type. The so-called regular relations (also refered to as ‘rational relations’) offer a natural way to formalize all kinds of translations and transformations, which makes multi-tape automata interesting for many practical applications and explains the general interest in this kind of device. A natural subclass are monoidal finite-state transducers, which can be defined as two-tape automata where the first tape reads strings. In this chapter we present the most important properties of monoidal multi-tape automata in general and monoidal finite-state transducers in particular. We show that the class of relations recognized by n-tape automata is closed under a number of useful relational operations like composition, Cartesian product, projection, inverse etc. We further present a procedure for deciding the functionality of classical finite-state transducers.
In this chapter we introduce the C(M) language, a new programming language. C(M) statements and expressions closely resemble the notation commonly used for the presentation of formal constructions in a Tarskian style set theoretical language. The usual set theoretic objects such as sets, functions, relations, tuples etc. are naturally integrated in the language. In contrast to imperative languages such as C or Java, C(M) is a functional declarative programming language. C(M) has many similarities with Haskell but makes use of the standard mathematical notation like SETL. The C(M) compiler translates a well-formed C(M) program into efficient C code, which can be executed after compilation. Since it is easy to read C(M) programs, a pseudo-code description becomes obsolete.
In this chapter we introduce the bimachine, a deterministic finite-state device that exactly represents the class of all regular string functions. We prove this correspondence, using as a key ingredient a procedure for converting transducers to bimachines. Methods for pseudo-minimization and direct composition of bimachines are added.
In this chapter we present C(M) implementations of the main automata constructions. Our aim is to provide full descriptions of the implementations that are clear and easy to follow. In some cases the simplicity of the implementation is achieved at the expense of some inefficiency.
In this chapter we explore deterministic finite-state transducers. Obviously, it only makes sense to ask for determinism if we restrict attention to transducers with a functional input-output behaviour. In this chapter we focus on transducers that are deterministic on the input tape (called sequential or subsquential transducers). We shall see that only a proper subset of all regular string functions can be represented by this kind of device and we describe a decision procedure for testing whether a functional transducer can be determinized. Further we present a subsequential transducer minimization procedure based on the Myhill–Nerode relation for string functions.
Finite-state methods are the most efficient mechanisms for analysing textual and symbolic data, providing elegant solutions for an immense number of practical problems in computational linguistics and computer science. This book for graduate students and researchers gives a complete coverage of the field, starting from a conceptual introduction and building to advanced topics and applications. The central finite-state technologies are introduced with mathematical rigour, ranging from simple finite-state automata to transducers and bimachines as 'input-output' devices. Special attention is given to the rich possibilities of simplifying, transforming and combining finite-state devices. All algorithms presented are accompanied by full correctness proofs and executable source code in a new programming language, C(M), which focuses on transparency of steps and simplicity of code. Thus, by enabling readers to obtain a deep formal understanding of the subject and to put finite-state methods to real use, this book closes the gap between theory and practice.
Email your librarian or administrator to recommend adding this to your organisation's collection.