1 Introduction
At first glance, combining the vast and diverse areas of quantitative and computational phonology in a single survey seems like a daunting task.Footnote 1 But even a little reflection reveals the necessity of this combination, as establishing an agreed-upon dividing line between these areas is neither possible nor particularly useful. For example, what makes an approach to phonology a “computational” one? Perhaps it means the use of software to implement and/or test a particular model, analysis, or learning algorithm. But this definition omits or at least misconstrues work grounded in a formal theory of the nature of computation that maintains a distinction between algorithm and code. Taking this idea even further, we might consider all work on phonology to be computational in nature, given that the phonology itself is a computational system that solves a variety of problems such as recognition (parsing an overt form), generation (mapping an underlying form to a surface form), and membership (assessing well-formedness). Furthermore, since a phonological grammar is the end target of acquisition, any work that theorizes or characterizes that grammar relates to some version of the phonological learning problem.
The term “quantitative” likewise invokes multiple associations, from the gradient nature of phonological patterns themselves, to the stochastic or nondeterministic algorithms used to model them, to the statistical methods that compare the resulting models to each other. In practice, research grounded in all of these ideas is of course aided by computational tools. The strategy behind this Element, then, is to instead embrace this entanglement and to celebrate the innovation and enthusiasm of scholars who have applied algorithmic, formal, mathematical, statistical, and/or probabilistic methods to the study of phonology and the computational problems it solves. This goal of course provides an enormous amount of ground to cover, and so an additional inclusion criterion is a shared assumption of an abstract phonological grammar in the generative tradition (broadly construed) that is distinct from – though not necessarily independent of – the phonetics. As a result, certain lines of work that are unequivocally computational and/or quantitative are regrettably being left out in the interest of space limitations and narrative cohesion. These include (among others) the learning of phonetic categories (e.g.,Dillon et al. Reference Dillon, Dunbar and Idsardi2013; Thorburn et al. Reference Thorburn, Lau, Feldman, Gong and Kpogo2022; Matusevych et al. Reference Matusevych, Schatz, Kamper, Feldman and Goldwater2023), the modeling of lexical acquisition via phonetic variation (e.g., Elsner et al. Reference Elsner, Goldwater, Eisenstein, Li, Lin, Osborne, Lee and Park2012, Reference Elsner, Goldwater, Feldman, Wood, Yarowsky, Baldwin, Korhonen, Livescu and Bethard2013), and usage-based and exemplar models of phonology (e.g., Bybee Reference Bybee2001, Reference Bybee2007; Pierrehumbert Reference Pierrehumbert, Bybee and Hopper2001a, Reference Pierrehumbert2001b).
For the areas that will be covered, the objective is to be representative rather than exhaustive, in order to demonstrate the myriad ways that quantitative and computational approaches have been applied to a variety of questions grounded in a variety of theoretical perspectives. To that end, the organizational structure is a combination of methods, theories, and problems of interest. Specifically, the outline of the Element is as follows. Section 2 discusses computational work on rule-based phonological grammars and how they are learned. Section 3 then reviews work on constraint-based phonology, with a focus on computational and probabilistic models for the learning of grammars and hidden structure.
Following these discussions of work grounded in phonological grammars of particular types, the next three sections show how these traditional core areas of research have been extended to address additional areas of interest through a variety of methods. Section 4 turns to a central area where quantitative methods have been employed: gradient acceptability and the contribution of lexical statistics to phonotactic generalizations. Section 5 briefly surveys the application of information theoretic methods to questions about phonological structure, and then Section 6 likewise briefly surveys the past, current, and potential future applications of connectionist models to the study of phonology. Stepping back, Section 7 discusses the contributions that formal language and model theoretic approaches to phonology can and have made to the study of phonological typology, learning, and representations. Rather than focusing on particular types of grammars, such work draws distinctions based on phonological patterns themselves, making its findings relevant to any theory of the phonological grammar and how it is learned. Finally, Section 8 concludes.
Importantly, these sectional groupings are not intended as a partition, as the boundary lines (both conceptual and methodological) are often blurry. The broader goal is instead to highlight how these categories complement each other and have great potential for further integration. These problems are hard. Studying them from as many angles as possible can only lead to greater collective progress on the many fascinating open questions about the phonological component of human languages.
2 Rule-Based Phonology
We begin with a discussion of formal approaches to rule-based phonology, a term most often defined in contrast to constraint-based phonology, which will be discussed in Section 3. The distinction between these categories is not absolute, as rule-based theories can and have made use of constraints (e.g., morpheme structure constraints, Halle Reference Halle1959, Stanley Reference Stanley1967, or constraints that block or trigger a rule’s application; see Reiss Reference Reiss, Vaux and Nevins2008 for examples and discussion). A set of rules can also function like constraints on input–output mappings, as in Koskenniemi’s (Reference Koskenniemi1983) theory of two-level rules. Furthermore, the term rule is sometimes used to refer to a context-dependent pattern of alternation between segments. To “learn a phonological rule” can mean to learn the fact that, for example, {t, d} in American English are flapped between two vowels, or {p, t, k} are aspirated as simple onsets. The use of “rule” here is just a shorthand for a type of pattern and is not necessarily tied to any particular assumption about the form of the phonological grammar.
The scope of this section includes work that does make such an assumption, namely that the phonological grammar consists of a set of ordered context-sensitive rules. Two main threads of research will be discussed. The first is work on formalizing such grammars, including specifying the algorithm by which a rule applies to a string. The second is work on the learning of these grammars.
Arguably, the first theory that comes to mind in the context of rule-based phonology is the one proposed in Chomsky and Halle’s (Reference Chomsky and Halle1968) Sound Pattern of English (hereafter SPE), in which phonological rules take the form in (1):
(1)
This rule asserts that A is rewritten as B when in the context X__Y (i.e., the string XAY is rewritten XBY). However, the extension of such a rule (i.e., the set of input–output string pairs it represents) is ambiguous without a specification for how the rule applies to a string. This ambiguity is most apparent with rules and strings that have the potential for multiple applications. Consider the rule in (2a) and the string in (2b); depending on how the rule applies, the output for this string could be [par], [pa], etcetera.
a.
b. /park/
The rule-application algorithm specified in SPE has become known as simultaneous application and is stated as follows: “To apply a rule, the entire string is first scanned for segments that satisfy the environmental constraints of the rule. After all such segments have been identified in the string, the changes required by the rule are applied simultaneously” (Chomsky and Halle Reference Chomsky and Halle1968: 344). Assuming this algorithm, the rule in (2a) maps /park/ to [par], since in the underlying representation (UR) only /k/ satisfies the rule’s structural description ([+cons]#).
Johnson (Reference Johnson1972) provides a formal argument for why simultaneous rules are preferable to iterative ones, where an iterative rule is defined as one that applies repeatedly to a string as long as any targets remain. Specifically, he proves that iterative rules can simulate any arbitrary string rewriting system, while simultaneous ones are limited to the power of finite-state transducers (FSTs). He then goes on to develop a theory of linear rules, which – like iterative rules – can apply multiple times to a string, but – like simultaneous rules – are limited to the expressivity of finite-state. Their restrictiveness comes from a requirement that each successive application moves further into the string, in contrast to iterative rules for which there is no imposed order on the different applications. In other words, linear rules are directional and are specified to apply either left to right or right to left.
Despite their formal equivalence, linear rules are argued to be preferable to simultaneous ones on account of the greater simplicity with which they can capture multiple applications. Simultaneous rules do allow for multiple applications through the mechanism of parenthesis-star, by which a rule becomes an abbreviation for an infinite set of rules that contain zero or more tokens of the expression in parentheses. Consider the example ATR harmony rule in (3a). Simultaneous application will only identify the first /ʊ/ as a target in (3b), with the resulting output [putukʊ].
a.
b.
If the actual surface form reflects an additional application (i.e., [putuku]), the rule can be instead formulated as in (4a). This version represents an infinite set of rules, including (3a) as well as (4b) and (4c), that allows for zero or more [−ATR] vowels to intervene between the trigger and target. As long as a vowel satisfies one of the rules in this infinite expansion in the UR, it will harmonize to [+ATR] under simultaneous application (i.e., the second /ʊ/ in (3b) satisfies rule (4b)).
a.
b.
c.
As a linear rule that applies left to right, however, (3a) can achieve the same effect without the added notation of parenthesis-star. After the first application generates [putukʊ], the newly created [u] can serve as the trigger for the final vowel. Johnson’s work thus provided a formal grounding from which to compare theories of rules and rule application in terms of both descriptive adequacy and elegance.Footnote 2
We turn now to the question of learning rules and rule-based grammars. From the perspective of formal learnability, such grammars do not have many inherent advantages. Context-sensitive grammars like SPE are not learnable from positive data; indeed (as will be discussed further in Section 7), not even regular grammars are, despite being situated well below context-sensitive on the Chomsky–Schützenberger hierarchy (Chomsky Reference Chomsky1959; Chomsky and Schützenberger Reference Chomsky and Schützenberger1959):
(5) Finite Regular Context-free Context-sensitive Recursively enumerable
Early work on rule learning then focused on the question of how a learner can identify the correct rule or grammar when more than one are descriptively adequate. Johnson (Reference Johnson and Wilks1984) demonstrates a deductive approach to learning a limited set of phonological rules of the form in (6), where “a” and “b” are segments and “C” is a feature matrix of unspecified length that is a subset of the segments surrounding “a.”
(6)
The input data is a set of paradigms with stems inflected with various affixes. Through inspection of this data, the learner identifies the contexts in which “a” alternates with “b” without including those tokens of “a” that do not alternate. In the case of multiple ordered rules, it entertains all possible hypotheses for which alternation took place last, undoes that discovered rule, and then repeats this procedure until all alternations have been accounted for. The procedure is error driven in the sense that it rejects a hypothesis when it arrives at a point where the rule discovery procedure fails. This working-backward technique means the learner will also propose URs for the stems and affixes present in the data.
Johnson’s approach demonstrates that the presence of non-surface-true patterns resulting from rule ordering is not inherently a barrier to learning, but additional selection criteria are needed for such a learner to converge on a single grammar. In a case in which the rules are not strictly ordered (i.e., multiple orderings will correctly generate the data), the learner will identify multiple grammars with no way of deciding among them. Similarly, given that the context of a single rule will not always be strongly data determined, the learner will need to appeal to some type of evaluation metric or guiding principle to select among a set of adequate contexts.
Other work that addresses the learnability of rules has been couched in a substance-free theory that emphasizes the formal nature of phonology as a computational system that operates over symbolic representations. Reiss (Reference Reiss, Vaux and Nevins2008: 258–9) – summarizing views presented in earlier work including Hale and Reiss (Reference Hale and Reiss2000a, Reference Hale, Reiss, Burton‐Roberts, Carr and Docherty2000b) – articulates the substance-free view of the phonological system as follows:
The computational system treats features as arbitrary symbols. What this means is that many of the so‐called phonological universals (often discussed under the rubric of markedness) are in fact epiphenomena deriving from the interaction of extragrammatical factors like acoustic salience and the nature of language change. Phonology is not and should not be grounded in phonetics since the facts which phonetic grounding is meant to explain can be derived without reference to phonology.Footnote 3
This focus on phonological computation enables a streamlined conception of universal grammar (UG) that facilitates formal discussions and treatments of learning problems. For example, Dell (Reference Dell1981) discusses the subset problem inherent to the learning of optional rules. As an example, French optionally deletes an /l/ that follows an obstruent and precedes a consonant or pause:
a. , “which table?”
b. , parle, “speak”
Consider two versions of this deletion rule: Deletion A targets coda /l/’s following obstruents, and Deletion B targets coda /l/’s following any consonant. The evidence that distinguishes these two rules is negative evidence, such as the ungrammaticality of *[par] in (7b). But given that such evidence is unavailable to the French-learning child, how do they come to select Deletion A instead of Deletion B? This is the subset problem. The grammar that includes the less restrictive Deletion B will correctly generate all of the observed data. Without negative evidence, the learner will never have reason to consider a more restrictive grammar (i.e., one that generates a proper subset of the forms generated by its current grammar). Dell then proposes that the language acquisition device (LAD) must include the strategy of always selecting the more restrictive grammar when faced with this choice.
Notably, this scenario is only relevant to the case of optional rules. As an optional rule, Deletion A generates the language including [tabl], [tab], and [parl], while the less restrictive Deletion B generates [tabl], [tab], [parl], and *[par]. But if the rules are obligatory, the subset–superset relationship no longer holds: Deletion A generates [tab] and [parl] while Deletion B generates [tab] and *[par]. In this case, positive data – specifically, encountering [parl] – will be sufficient to correct an earlier hypothesis of Deletion B. In this way, obligatory rules serve to provide indirect negative evidence (i.e., if [parl] is grammatical, then *[par] must not be). The challenge, of course, is how the learner can know whether the rule they are learning is optional or obligatory. To address this issue, Dell further proposes that the learner assumes the rule is obligatory until they encounter evidence of optionality.
Hale and Reiss (Reference Hale and Reiss2008) recast the subset principle as a description of the learner’s initial state, rather than a guiding principle for selecting among competing grammars. In particular, they argue that UG must provide the complete set of primitives (i.e., features) so that the learner can posit the most specified rule possible (contra typically assumed pressures of economy in rule formulations) that accounts for all and only the forms to which it applies. Generalizing over observed instances of the rule is achieved via set intersection over the fully specified feature bundles. Maximally specifying while still generalizing is necessary for the reasons stated earlier in this Element: an overly general rule will never be contradicted by positive evidence alone.
This principle of generalizing over more specified instances was employed in the minimal generalization learner of Albright and Hayes (Reference Albright, Hayes and Maxwell2002, Reference Albright and Hayes2003), who investigate rule learning in the context of the rules-versus-analogy debate of inflectional morphology (e.g., Pinker and Prince Reference Pinker, Prince, Lima, Corrigan and Iverson1994; Bybee Reference Bybee2001).Footnote 4 The learner compares pairs of (present, past) forms to identify the change that derives the past from the present. For example, the comparison of shine and shined reveals the rule in (8):
(8)
This rule is, of course, overly specific, but additional comparisons reveal more rules that share its structural change (e.g., grab-grabbed, hug-hugged, fill-filled, etc.). The specifications of these rules can be combined into the more general rule in (9a), or using features, (9b).
a.
b.
This procedure will generate a set of rules that differ in their generality, such that a given input may be subject to more than one rule. To address this ambiguity, all rules are also given a confidence score, which is the number of forms in the training data that the rule correctly applies to (i.e., its number of hits) divided by the number of forms it can apply to (i.e., the rule’s scope). The set of rules along with their confidence scores can account for gradience (see Section 4) if an output’s well-formedness is taken as the score of the “best” rule that generates it.
Getting back to the question of the learner’s initial state, Hale and Reiss’s (Reference Hale and Reiss2008) argument is that because the distinctions needed to work out the phonological system can only be detected to the extent that parsing encodes forms differently, the maximally specified initial representation needed to avoid the subset problem is only possible if UG provides all features. But Odden (Reference Odden2022) rejects this assumption in favor of radical substance-free phonology, in which only the abstract concept of a feature is provided by UG rather than an actual feature set. The learning of both the feature set and the rules of the grammar is guided by the evidence of how and which sounds pattern together (see also Mielke Reference Mielke2008).Footnote 5 Even though the phonological system is responsible for parsing those sounds that are recognized as linguistic objects, the auditory system deals with acoustic representations of all sounds. If the current hypothesis for the phonological grammar has discarded important information about, for example, which sounds contrast, that information is still available through the auditory system and so a correction can be made. Given that, the learner’s objective can be characterized by something other than avoiding the subset problem. Odden’s (Reference Odden2022: 526) proposal instead emphasizes simplicity: “The task of feature acquisition is finding the simplest system of properties that accounts for those cases of grammatical functioning-together that can be observed in the primary linguistic data.”
The significance of both restrictiveness and simplicity is recognized by Rasin et al. (Reference Rasin, Shefi, Katzir, Asatryan, Song and Whitmal2020, Reference Rasin, Berger, Lan, Shefi and Katzir2021), who argue that the learner’s task is actually to find the optimal balance between these potentially conflicting demands. Specifically, they propose the use of minimum description length (MDL) for learning not just a grammar of ordered rewrite rules but also the lexicon of URs.Footnote 6 Length in this case is the combined total (in bits) of the grammar itself as well as the encoding of the data given that grammar. Consider again the example of optional /l/ deletion in French. The context-free version of this rule in (10a) is shorter than the target version in (10b), and so would be favored by simplicity.
a.
b.
Of course (10a) will over-generate, but as discussed earlier in this section, the learner cannot recover from that error without negative evidence. Employing something like the subset principle is necessary to select the correct rule.
However, Rasin et al. argue that the optionality of the rule creates a further problem. If the learner has only encountered one of the possible rule outputs – say either [tab] or [tabl] for “table” – then a grammar that does not generalize to predict the other form will be more restrictive than the one that includes (10b). One such grammar would be the one that does not propose a rule at all, but instead just lists in the lexicon all of the forms observed so far. In this way, the subset principle’s insistence on restrictiveness can have the effect of preventing generalization.
Minimum description length balances these competing demands by computing not just the size of the grammar, but the “cost” of describing the data with that grammar. In particular, the cost of choosing a UR from the lexicon increases with the size of the lexicon (2 choices = 1 bit, 4 choices = 2 bits, etc.), and so generalizing with a rule is cheaper. In addition, whether or not an optional rule has applied to a UR that it can apply to is specified with an additional bit. (This added bit is not needed for obligatory rules, because whether or not they apply is determined by the choice of UR itself.) As an optional rule, then, (10b) is actually cheaper than (10a) because it can apply to fewer URs.
As the recency of Rasin et al.’s (Reference Rasin, Shefi, Katzir, Asatryan, Song and Whitmal2020, Reference Rasin, Berger, Lan, Shefi and Katzir2021) work shows, the question of how to learn rule-based grammars has not been abandoned, but it is undeniable that the progress in phonological learning research was greatly accelerated following the shift to constraint-based grammars like optimality theory (OT) (Prince and Smolensky Reference Prince and Smolensky1993, Reference Prince and Smolensky2004). The next section turns to the research on these grammar types, the comparisons among them, and how to learn them.
3 Constraint-Based Phonology
Constraint-based phonology has become an umbrella term for a collection of phonological theories that centralize constraints on representations instead of the procedures by which those representations are changed.Footnote 7 Given how readily these theories lend themselves to computational learning models and statistical methods for working with quantitative data and patterns, it is not surprising that the shift in the field from rule- to constraint-based phonology corresponded to a surge in research on these areas that continues today. This section will begin with an overview of different types of constraint-based grammars (Section 3.1), followed by a survey of the work addressing a variety of phonological learning problems (Section 3.2). Lastly, Section 3.3 will highlight some of the arguments used to compare these theories to each other, including their ability to address questions of theoretical interest as well as what is known about their respective complexity and learnability.
3.1 Constraint-Based Theories of Phonology
The theory of declarative phonology (Bird et al. Reference Bird, Coleman, Pierrehumbert, Scobbie, Crochetière, Boulanger and Ouellon1992; Bird Reference Bird1995) forgoes derivation in favor of inviolable (i.e., “hard”) constraints whose interaction is compositional: all constraints must be satisfied simultaneously for a phonological object to be licit. The term declarative invokes the distinction between declarative and imperative programming languages, the former emphasizing the logic of the computation over the step-by-step procedure that performs it. Logic also provides a formal description language for the constraints themselves, which can be stated using predicates and logical connectives. For example, the constraint in (11) asserts that every onset is dominated by (∂) a syllable (example from Scobbie et al. Reference Scobbie, Coleman, Bird, Durand and Laks1996):
(11)
What we might call a rule can also be represented with a constraint in the same description language – for example, (12), which says a high back vowel is specified as [+round]. Further conditions can be added to the antecedent to constrain this specification to certain contexts.
(12)
Lastly, morphemes or lexical entries are also stated as constraints (or partial descriptions), in contrast to more typical generative assumptions that differentiate the lexicon from the grammar that operates on it. For example, (13) gives a partial description of a vowel:
(13)
Declarative phonology has been applied to the study of many types of phonological structure; for examples see Broe (Reference Broe1993), Bird and Ellison (Reference Bird and Ellison1994), and Scobbie et al. (Reference Scobbie, Coleman, Bird, Durand and Laks1996), and references therein.
Inviolable constraints are language-particular and at times need to be quite specific in order to capture the context-dependent nature of phonological alternations and feature licensing. In contrast, violable constraints can be more general and amenable to arguments of a universal constraint set. However, the potential for violation necessitates a different mechanism than full satisfaction to determine well-formedness. One such mechanism is to rank the constraints in order of importance (or severity of violation), which is the foundation of OT.
As a basic example, the grammar in Table 1 shows how a word-final devoicing pattern can be represented in OT. Columns after the first one are labeled with members of the universal constraint set CON, and rows after the first one are labeled with one of the infinite set of candidate surface forms provided by GEN. Cells are filled in by EVAL, which assesses whether and to what extent each candidate violates each constraint.
UR: /bad/ | DEP | MAX | *D# | Ident(voice) |
---|---|---|---|---|
[bad] | *! | |||
☞[bat] | * | |||
[ba] | *! | |||
[bada] | *! |
The UR /bad/ violates the markedness constraint *D#, which says words cannot end in voiced obstruents. The winning candidate (marked with ☞), also violates a constraint, the faithfulness constraint Ident(voice), which says that surface specifications of the feature voice should match their underlying values. But because the constraints are ranked in order of importance (with left-to-right corresponding to more-to-less important), the violation of Ident(voice) is less serious than the violation of *D#, and so [bat] is the optimal form among those evaluated. Additional faithfulness constraints that are violated by other means of avoiding the violation of *D# – including deleting the obstruent (MAX = don’t delete) or adding a word-final vowel (DEP = don’t add) – must also be ranked above Ident(voice) so as to ensure the candidate that violates it is in fact the winner.
“Classic” OT grammars are categorical, mapping an input to its single, optimal output. But there is great interest in models of grammar that allow for multiple outputs due to variation, as well as account for the observed gradience in acceptability judgments and/or lexical statistics (as will be discussed at length in the next section). Proposals for modeling gradience with a grammar of ranked constraints have included (1) stochastic OT, (2) partially ordered constraints, and (3) the rank-ordered model of EVAL. These will now be discussed in turn.Footnote 8
In stochastic OT (Boersma Reference Boersma1997; Boersma and Hayes Reference Boersma and Hayes2001), constraints are associated with a range of values rather than a fixed position. During evaluation, a random variable introduces noise that establishes each constraint’s position in its respective range, and then all constraints are ranked according to these selected positions. If two constraints have nonoverlapping ranges, it amounts to a fixed ranking between them. With overlapping ranges, the respective ranking of two constraints will vary in a way that reflects how often different candidates surface as the winner for a given UR. For instance, if in the provided example final devoicing is optional, the two candidates [bad] and [bat] could surface in proportion to how often the positions of *D# and Ident(voice) are reversed in the ranking order. While the grammar is still categorical, gradient well-formedness can be captured in terms of the percentage of some number of trials in which a particular form wins.Footnote 9
Another approach to handling optionality in OT is the use of partial orders of constraints. In the example grammar in Table 1, the constraints are actually in a partial, not strict, order because the relative ranking of DEP, MAX, and *D# is irrelevant to selecting the winner. This partial order, shown in (14), can be cashed out into multiple strict orders, some of which are listed in (15).
(14)
a.
b.
c.
In the case of optionality, multiple output forms are possible as long as they win under at least one of the strict orders allowed by the grammar’s partial order. Anttila (Reference Anttila, Hinskens, van Hout and Wetzels1997a, Reference Anttila1997b) demonstrates this potential with the complex patterns of Finnish noun inflections. For example, the genitive form of /maailma/, “world” varies between the “strong” and “weak” forms in (16a) and (16b) (acute and grave accents indicate primary and secondary stress, respectively).
(16)
a.
b.
In Anttila’s analysis, NoClash allows for alternating syllables to be stressed, but secondary stress is optional. The two outputs in (16) tie on the weight-to-stress constraints *Ĺ and *H (i.e., neither violates the former and both violate the latter once). This indeterminacy of the grammar explains why both forms are permitted, as well as their observed frequency of occurrence (~50/50). Optionality is also predicted when two candidates do not tie but disagree on constraints that are not strictly ordered. This is the case for the stem /naapuri/, “neighbor,” which has the possible output forms shown in (17a) and (17b).
(17)
a.
b. [náa.pu.ri.en]
Candidate (17a) violates *H/I (weight–sonority harmony) and *Í (stress–sonority harmony), while candidate (17b) violates *L.L (no lapse). But both are able to surface because this set of constraints {*H/I, *Í, *L.L} is not in a strict order. Which one actually surfaces is determined by random selection of one of the possible strict orders of these three constraints. Furthermore, the probability of each form can be determined as the proportion of possible orders it wins under. In this example, (17a) wins whenever the two constraints it violates are ranked below the one constraint violated by (17b). The ratio of four orders under which (17b) wins to two orders under which (17a) wins corresponds closely to their observed frequencies in a corpus analysis.
Lastly, Coetzee’s (Reference Coetzee2006) proposal attributes variation to the way constraints are evaluated instead of the way they are ranked. In this rank-ordering model of EVAL, relative grammaticality can be assessed even among the nonoptimal candidates that do not win. Consider again Table 1, and assume (for the sake of demonstration) that MAX, DEP, and *D# are strictly ordered as shown. Putting aside the winning candidate [bat], the remaining candidates can be ordered according to the ranking of the constraint they fatally violate: [bad] is more well-formed than [ba], which is in turn more well formed than [bada]. The consequent prediction is that the higher a candidate appears in this order, the more frequent it will be. Limits on variation are imposed with a “cutoff” point in the constraint ranking, such that variation is only possible among candidates whose well-formedness is determined by constraints below the cutoff.
Another prominent approach to variation is the use of a constraint set that is weighted instead of ranked, as in OT’s predecessor harmonic grammar (HG) (Legendre et al. Reference Legendre, Miyata, Smolensky, Ziolkowski, Noske and Deaton1990; Smolensky and Legendre Reference Smolensky and Legendre2006). With weighted constraints, candidates are assessed using the weighted sum of constraint violations, called the harmony score:
(18)
In (18), wk is the weight of constraint k, and sk is the number of violations (typically represented with negative numbers). The optimal candidate is the one with greatest harmony, or the score closest to zero. Table 2 presents an HG version of the grammar from Table 1 with MAX and DEP omitted for simplicity. The constraint weights are listed at the top of each column and the candidates’ harmony scores are listed at the end of each row.
weights: | 2 | 1 | |
UR: /bad/ | *D# | Ident(voice) | Harmony: |
[bad] | −1 | −2 | |
☞[bat] | −1 | −1 |
One advantage of HG is its ability to model the cumulative effects of violating multiple constraints, in contrast to OT in which only violations of the highest-ranked decisive constraint matter. (More will be said about cumulativity in Section 3.3.) But the grammar is still categorical and outputs a single optimal form. To address variation, Noisy HG (Boersma and Pater Reference Boersma, Pater, McCarthy and Pater2016) adds random noise to the constraint weights during evaluation:
(19)
In (19), N is a random variable sampled from a Gaussian distribution. The use of noise to adjust constraint weights allows for potentially different outputs to emerge as optimal, but each time the grammar is used there is still only one output. In contrast, a maximum entropy (MaxEnt) HG grammar produces multiple outputs in the form of a probability distribution over the candidate set. The conditional probability of each candidate y given a UR x is calculated by raising the base of the natural logarithm to the candidate’s harmony score (H(y)) and normalizing over all candidates under consideration, Y (Goldwater and Johnson Reference Goldwater, Johnson, Spenader, Eriksson and Dahl2003).
(20)
(21)
The constraint weights are identified with maximum likelihood estimation: the goal is to find the weights that maximize the product of the conditional probabilities of all input–output pairs in the training corpus. This learning objective is a particular conception of the phonological learning problem, one that is tied to the initial assumption about what form the phonological grammar takes. The next section will explore this connection between grammar and learning problems further by discussing the phonological learning literature grounded in the previously described constraint-based theories.
3.2 Learning with Constraint-Based Grammars
As with rule-based grammars, a central question for constraint-based grammars is how they are learned from positive data. This section will survey the various ways this question has been addressed, which include different formulations of the learning problem itself. Section 3.2.1 starts with work that assumes the constraint set is known in advance – either as a simplifying assumption or because it is provided by UG – and therefore the learning problem is a matter of identifying the correct ranking of these provided constraints. Section 3.2.2 addresses the problems inherent to learning from surface forms alone, which include the learning of hidden structure and underlying forms. Lastly, Section 3.2.3 turns to the problem of learning the constraints themselves.
3.2.1 Learning Constraint Rankings
Under the assumption that the phonological grammar is a set of ranked constraints and the constraints themselves are provided by UG, the learning problem is to identify the correct ranking of those constraints for the target language. One advantage of defining the learning problem in this way is that the logic of optimization provides implicit negative evidence in the form of the candidates that do not win. Influential work by Tesar (Reference Tesar1995) and Tesar and Smolensky (Reference Prince and Smolensky1993, Reference Smolensky1996, Reference Tesar and Smolensky1998, Reference Tesar and Smolensky2000) demonstrated how this evidence can be used with an algorithm called recursive constraint demotion (RCD).
Recursive constraint demotion makes use of winner–loser pairs of candidates to produce a stratified hierarchy of groups of constraints, in which constraints in the same group do not conflict with one another.Footnote 10 Returning to the final devoicing example, Table 3 is a comparative tableau (Prince Reference Prince2000) for pairs of candidates (the first being the desired winner) with indicators of which constraints prefer the winning candidate (W) or the losing candidate (L). Constraint preference here refers to which candidate violates the constraint to a lesser degree; blank cells indicate that the candidates tie on that constraint.
UR: /bad/ | Ident(voice) | *D# | MAX | DEP |
---|---|---|---|---|
bat ~ bad | L | W | ||
bat ~ ba | W | |||
bat ~ bada | W |
The basic logic of optimization is that the constraints that favor losing candidates must be outranked by at least one constraint that favors the winner. To achieve this, RCD first identifies those constraints that prefer only winners and situates them in the top stratum of the hierarchy. In this example, those constraints are *D#, MAX, and DEP. Winner–loser pairs that are accounted for with this ranking can then be removed, and the process repeats with the remaining pairs and constraints. With this simple example, all winner–loser pairs are accounted for after the first pass, leaving the ranking of {*D#, MAX, DEP} ≫ Ident(voice) as desired.
With respect to the subset problem discussed in Section 2, identifying the subset relations among grammars of ranked constraints becomes increasingly infeasible as the size of the assumed constraint set grows (i.e., there are k! possible grammars for a set of k constraints). Instead, the search for the most restrictive grammar consistent with the observed data has been addressed through the relative ranking of markedness constraints with respect to faithfulness constraints. Consider the (simplified) example of learning stress patterns as in the PAKA system defined by Tesar et al. (Reference Tesar, Alderete, Horwood, Merchant, Nishitani, Prince, Garding and Tsujimura2003). Following richness of the base (Prince and Smolensky Reference Prince and Smolensky1993), underlying forms can be either stressed or unstressed, and the constraints in question include the markedness constraint StressLeftmost (first syllable is stressed) and the faithfulness constraint Ident(stress) (preserve underlying values for stress). The ranking of markedness over faithfulness (StressLeftmost ≫ Ident(stress)) is the most restrictive: all underlying contrasts for the feature stress collapse to the predictable pattern of first syllable stress. The ranking of faithfulness over markedness (Ident(stress) ≫ StressLeftmost) is the least restrictive in that all underlying contrasts are preserved.
More generally, a grammar’s degree of restrictiveness can be assessed in terms of how many markedness constraints dominate faithfulness constraints. Prince and Tesar (Reference Prince, Tesar, Kager, Pater and Zonneveld2004) call this the r-measure, with a larger r-measure corresponding to a more restrictive grammar. But now consider that in the case of phonotactic learning – in which the UR is assumed to be identical to the surface representation, or SR (more on this in the next section) – faithfulness constraints are effectively inviolable. Since RCD ranks constraints as high as possible, faithfulness constraints will end up at the top of the hierarchy at great cost to the r-measure. Prince and Tesar’s (Reference Prince, Tesar, Kager, Pater and Zonneveld2004) proposed solution is biased constraint demotion (BCD), in which faithfulness constraints are only situated into the hierarchy when no markedness constraints are available.Footnote 11 The consequence is a delay in ranking faithfulness constraints that maximizes the resulting grammar’s r-measure, as desired.
In the case of stochastic OT, the learning problem is not to learn a fixed ranking, but the range of values associated with each constraint. The gradual learning algorithm (GLA) proposed by Boersma (Reference Boersma1997) and Boersma and Hayes (Reference Boersma and Hayes2001) assumes these ranges are Gaussian distributions (with a fixed standard deviation) that are centered on a constraint-specific ranking value, in which case the target of learning is to identify these ranking values. The learner is error-driven and makes use of (UR, SR) pairs for which the current hypothesis of the grammar selects an incorrect winner as the SR. A constraint that is violated by the actual SR but not the incorrect winner is moved down the scale, while a constraint violated by the incorrect winner but not the actual SR is moved up the scale. These movements are by a fixed amount, called the plasticity.
Stochastic OT’s ability to handle variation means it can learn from noisy data that not only reflects optionality but also includes speech errors. When the same UR appears with multiple SRs in the training data, each token will have an effect on the ranking values of the relevant constraints. The resulting grammar will then generate those forms in proportion with their frequency of occurrence in the data. In addition, as demonstrated by Zuraw (Reference Zuraw2000), stochastic ranking can account for lexical regularities that do not drive alternations and often have exceptions. With the GLA’s method of adjusting ranking values, the more words that violate a constraint, the lower ranked it will be (and vice versa). The likelihood of a word being an exception is then captured by the degree of overlap among the relevant constraints.
To close this section, we will briefly discuss learning rankings in harmonic serialism (HS) (McCarthy Reference McCarthy, Hirotani, Coetzee, Hall and Kim2000), a constraint-based framework that reintroduces the concept of derivation. In HS, the UR–SR mapping occurs in steps, with each step consisting of an OT-style selection of the optimal candidate according to a fixed ranking of constraints. The winning candidate becomes the input to the subsequent step, and GEN is restricted such that each candidate can differ from the input by only a single violation of a faithfulness constraint. The derivation concludes when the fully faithful candidate is selected as the winner.
As discussed by Tessier and Jesney (Reference Tessier and Jesney2014), HS’s use of derivation introduces a challenge for error-driven learning in that the informative error may be hidden in one of the intermediate steps. To address this challenge, Tessier (Reference Tessier, Keine and Sloggett2012) proposes a multistage learning process in which the ranking information that can be gleaned from the SRs is later refined using the candidate set generated for observed forms, first to construct winner–loser pairs and then as hypothetical inputs to the grammar. Jarosz (Reference Jarosz, Hansson, Farris-Trimble, McMullin and Pulleyblank2016) also addresses the problems inherent to learning derivations in the context of serial markedness reduction (SMR) (Jarosz Reference Jarosz, Kingston, Moore-Cantwell, Pater and Staubs2014), a variant of HS proposed to handle opaque process interactions.
In SMR, candidates are annotated with which markedness constraints they satisfy, and additional serial markedness constraints are used to favor candidates that satisfy constraints in a particular order. To learn such grammars, Jarosz (Reference Jarosz, Hansson, Farris-Trimble, McMullin and Pulleyblank2016) proposes expectation driven learning (EDL), a probabilistic learning approach that in this case assumes a stochastic version of Anttila’s (Reference Anttila, Hinskens, van Hout and Wetzels1997a, Reference Anttila1997b) partial order grammars discussed previously in Section 3.1. The learner identifies the pairwise ranking probabilities of the constraints based on how often each ranking successfully generates the observed data. Because the learner considers each possible (pairwise, not total) ranking in turn, it does not need information about the intermediate steps of the derivation and uses only examples of the composite UR–SR mapping.
The intermediate steps of an HS derivation are one example of the hidden structure problem that has drawn a lot of attention in the phonological learning literature. Expectation driven learning and other probabilistic learning methods for constraint-based grammars have played a large role in this line of work, which will be explored further in the next section.
3.2.2 Learning Hidden Structure
The term hidden structure refers to information not available in the observable data that is nonetheless important for identifying the grammar that generated those forms. While providing a learner with full structural descriptions and/or (UR, SR) pairs can be a valuable simplifying assumption for making initial progress on phonological learning problems, the more realistic setup of learning from overt forms is the ultimate goal. This section will review some of the work that has drawn on constraint-based frameworks to take on that challenge.
One type of hidden structure is the ambiguity of syllable boundaries and metrical structure. Consider an overt form [apa], which could be syllabified in various ways, including [a.pa] and [ap.a]. Similarly, stress placement on the first vowel could result from a trochaic foot (ápa), or a degenerate foot followed by an extrametrical final syllable (á)<pa>. The correct parse depends on the grammar, but to learn that grammar the learner needs to know what the correct parse is. To address this circularity, Tesar and Smolenksy (2000) incorporate robust interpretative parsing (RIP) into an iterative version of RCD. Starting from an assumed initial constraint hierarchy, RIP maps an overt form to its full structural description according to this grammar. The UR for that structural description is then mapped to its optimal SR, again according to the current grammar. If these two structural descriptions do not match, they are used as a winner–loser pair by RCD to revise the grammar. This parsing/production feedback loop iterates until there are no more mismatches.Footnote 12
As a simple example, consider a target grammar that assigns penultimate stress by way of a right-aligned trochee. Now assume some (incorrect) constraint hierarchy that parses the overt form [σσσσ́σ] as [σσ(σσ́)σ]. That same grammar then parses the UR for this form, /σσσσσ/, as [(σσ́)σσσ]. Since these parses do not match – and further, the grammar’s placement of stress contradicts what is actually observed – the learner knows that the hypothesized constraint hierarchy is wrong and needs to be adjusted. As with RCD more generally, the RIP/CD algorithm capitalizes on the implicit negative evidence of spurious winning forms, generating its own informative errors by using the same grammar for production and parsing.
Another source of the hidden structure problem is the lexicon of URs. The ranking of constraints that generates a UR–SR mapping depends on what the UR is, but the learner again only has access to overt forms. This interdependence of the grammar and the lexicon has been addressed in various ways. In the surgery learning algorithm (Tesar et al. Reference Tesar, Alderete, Horwood, Merchant, Nishitani, Prince, Garding and Tsujimura2003), when BCD runs into inconsistency in the set of winner–loser pairs – for example, if two constraints are left that have opposite winner–loser preferences – the learner uses that dead end as a cue that the lexicon must be modified. Lexicon updates target each (alternating) morpheme in turn until the inconsistency is resolved, after which the winner–loser pairs containing that morpheme are adjusted to reflect the change.
The interdependence problem has also been addressed by drawing on the learner’s prior knowledge of phonotactics. Tesar and Prince (Reference Tesar, Prince, Cihlar, Franklin, Kaiser and Kimbara2007) explore this idea with an algorithm that first establishes a preliminary constraint ranking in a stage of phonotactic learning in which all URs are assumed to be identical to the SRs. Because we know the grammar accepts the SR, then if that SR were a UR we can assume the grammar would map it faithfully (since unfaithful mappings only occur when underlying structures cannot surface).
As an example, consider a target language in which codas are devoiced. A dataset of SRs for this language is given in (22). At this stage, the learner has no knowledge of morphological structure (i.e., all SRs are taken to be distinct and monomorphemic).
(22) {tate, date, tade, dade, tat, dat}
Starting with the hierarchy in (23) in which all markedness outranks faithfulness, the learner uses BCD to find the most restrictive ranking that maps all SRs to themselves.Footnote 13 The form [dat], for example, presents a problem because its violation of *Voice makes it less harmonic than [tat] according to the initial ranking. Yet both are grammatical. Demoting *Voice below Ident(voice) solves this problem.
(23)
And so on, until the learner arrives at the ranking in (24).
(24)
This preliminary ranking is then refined by bringing in knowledge of the morphological structure of the SRs and therefore witnessing alternations. The data at this point incorporates information about morpheme boundaries and identity:
(25)
When a feature is observed to alternate, the learner considers all possible candidates for the UR. For example, the UR of morpheme 1 is either /tad/ or /tat/. With these hypotheses, the learner can test its current grammar (24) with the possible mappings in (26).
a.
b.
As shown in Tables 4 and 5, Hypothesis A succeeds under the current constraint ranking, but Hypothesis B fails. The learner can thus conclude that Hypothesis A is correct and the UR is /tad/.
UR: /tad/ | *SyllableFinalVoice | Ident(voice) | *Voice | *IntervocalicVoiceless |
[tad] | *! | * | ||
☞[tat] | * | |||
UR: /tad-e/ | ||||
☞[tade] | * | |||
[tate] | *! | * |
UR: /tat/ | *SyllableFinalVoice | Ident(voice) | *Voice | *IntervocalicVoiceless |
[tad] | *! | * | ||
☞[tat] | ||||
UR: /tat-e/ | ||||
[tade] | *! | * | ||
☞ *[tate] | * |
Stepping back, the broader intuition here is that the hypothesis that /tat/ is the UR can only succeed if this language has intervocalic voicing, a possibility ruled out by the phonotactics (i.e., the existence of [tate]). But the devoicing required by the UR /tad/ is consistent, since no SRs end in voiced obstruents.
In this simple example, the initial constraint ranking gleaned from the phonotactics did not have to be revised, but more realistic cases will involve multiple faithfulness constraints that cannot be ranked with respect to each other based on phonotactics alone. As a result, the initial ranking may fail to generate all of the mappings for the hypothesized URs. In such a case, the mismatches between the observed winners and the optimums can again be used by BCD to revise the initial ranking, with the revision that succeeds being the indicator of which hypothesized UR is correct. The interdependence problem thus points to its own solution, as inconsistencies in grammar–lexicon combinations provide the cues to revise both in an error-driven feedback loop.
The utility of assuming that if a form x is grammatical then so must be the mapping x → x is explored further in Tesar’s (Reference Tesar2014, Reference Tesar2017) subsequent work on output-driven maps.Footnote 14 The designation of a map as output-driven refers to the following entailment relation: “for every grammatical candidate A→X of the map, if candidate B→X has greater similarity than A→X, then B→X is also grammatical (it is part of the map)” (Tesar Reference Tesar2017: 150). Similarity here refers to the number of disparities (i.e., feature changes) between inputs and outputs. For example, páká → paká: has two disparities (one stress, one length) and paká → paká: has one (length only). If the map is output-driven, then the inclusion of páká → paká: implies the inclusion of the more similar paká → paká:.
Tesar’s thesis is that the property of being output-driven imposes structure on the learner’s hypothesis space that can be exploited during its search for the correct grammar and lexicon. Importantly, output-drivenness is a property of the map itself, not of the OT grammar that generates it (see Section 7 for more on this idea). This is what is meant by structuring the hypothesis space. The set of maps that can be generated by OT grammars is large – on account of its combinatorics (k! possible rankings of k constraints, though more than one ranking may generate the same map) as well as its formal generative capacity (more on this in Section 3.3.1). The learner’s assumption that its target grammar can only generate a subset of those maps eliminates a great many hypotheses.
As in Tesar and Prince (Reference Tesar, Prince, Cihlar, Franklin, Kaiser and Kimbara2007), the output-driven learner (ODL) undergoes a stage of phonotactic learning without any morphological awareness before receiving information about alternations in order to identify URs. At this stage, the entailment relation inherent to the output-driven property serves to eliminate entire sets of possible URs all at once. To see how, consider an observed SR like [paká:]. The learner constructs a hypothesis UR with only one disparity relative to that SR, such as /paká/. If BCD then concludes that the mapping of /paká/ → [paká:] is inconsistent, the learner can reject the hypothesis of /paká/ as well as any other UR hypotheses that are less similar to the SR than /paká/ is (e.g., /pa:ká/, /paka/, /páka/, etc.).
From there, the learner can conclude that any feature value shared by all remaining hypothesized URs must be present in the correct UR. Once an underlying feature value is set in this way, SRs in which that feature surfaces unfaithfully can provide further information about the correct ranking. For example, if [páka] is an SR and the [ka] morpheme has already been identified as having the UR /ka:/, then the constructed mapping /páka:/ → [páka] must be grammatical, because any other possibility will involve more disparities. This mapping can then be used to construct winner–loser pairs and adjust the constraint ranking if needed.
The use of inconsistency detection to identify environments in which a feature is contrastive has precedent in the contrast pair and ranking (CPR) information algorithm of Merchant (Reference Merchant2008) and Merchant and Tesar (Reference Merchant, Tesar, Edwards, Midtlyng, Sprague and Stensrud2008). Contrast pair and ranking has the advantage of being able to set multiple features at once by constructing local lexica for all possible settings of unset features, but the ODL is ultimately more efficient given that the number of lexica can grow quite large depending on how many features alternate.Footnote 15
Looking beyond categorical grammars, the promise of probabilistic models for handling gradient phonotactics (see Section 4) lead to their application to a wider range of phonological learning problems, including the simultaneous learning of grammars and lexicons. For example, Jarosz (Reference Jarosz, Wicentowski and Kondrak2006a, Reference Jarosz2006b) characterizes this problem in the framework of maximum likelihood learning of lexicons and grammars (MLG), in which each possible constraint ranking is assigned a probability and the conditional probability of a candidate (given a UR) is summed across all rankings for which it is the winner. Another probability distribution across possible URs provides the conditional probability of a UR given a morpheme. The learning problem is then a matter of identifying the parameters for these distributions that maximize the likelihood of the training data (= morphologically analyzed SRs and their frequencies). Enacting richness of the base, the set of possible URs is rich, though not fully unconstrained. It is generated from the SRs based on all possible feature variants that could generate one of the observed SRs of the morpheme in question, as well as all possible insertions and deletions that could be generated under some constraint ranking. The learning algorithm is expectation maximization: starting from uniform distributions, the parameters are iteratively adjusted until convergence (i.e., the change from the previous iteration is below some threshold).
Jarosz (Reference Jarosz, Elliot, Kirby, Sawada, Staraki and Yoon2009) further shows how MLG as a probabilistic learner subsumes the kinds of ranking biases enacted by the OT learners discussed previously in this Element. The resulting grammars are restrictive in the sense that likelihood will be maximized by a grammar that maps as much of the rich base to observed forms as possible (i.e., by not wasting probability mass on unobserved forms). This works out to ranking markedness over faithfulness without an explicitly encoded bias. Working instead with MaxEnt, O’Hara (Reference O’Hara2017) shows that an explicit mechanism is also not needed for a probabilistic learner to learn abstract URs (i.e., URs with a combination of features that never surface together in a single SR), as these fall out naturally when observed gaps in segment distributions are minimized.
Another approach to UR learning has been the use of constraints on the URs themselves (Zuraw Reference Zuraw2000; Boersma Reference Boersma, Kirchner, Wikeley and Pater2001). Apoussidou (Reference Apoussidou2007) makes use of these lexical (or UR) constraints in an online error-driven learner. Lexical constraints prohibit a particular meaning–form pairing in the lexicon; for example, Apoussidou proposes the constraint in (27) as part of the grammatical stress system of Modern Greek:Footnote 16
(27) *|θalas-| “sea”: Do not connect the meaning “sea” to |θalas-|
Each candidate UR has its own constraint. Learning then proceeds through a recognition stage and a virtual production stage. Recognition involves an RIP-like process of mapping an SR to its optimum candidate (UR, SR, meaning) triplet. Virtual production checks which triplet the current grammar selects for that particular meaning. If the same candidate is selected in recognition and virtual production, no change is needed. Otherwise, the error signals a constraint reranking (via the GLA).
Lexical constraints prohibiting (or requiring) particular URs in a particular language clearly cannot be part of an innate and universal CON. Though no algorithm is given, Apoussidou (Reference Apoussidou2007: 170) suggests they could instead be induced whenever a new meaning–form combination is encountered. Going further, Nelson (Reference Nelson, Jarosz, Nelson, O’Connor and Pater2019) provides a method for inducing lexical constraints that also addresses the related problem of morpheme segmentation. Given an SR and the unordered set of morphemes it contains, lexical constraints are induced based on all possible segmentations of that SR. For example, if the SR [abc] contains two morphemes (M1 and M2), the possible segmentations are [ab-c] and [a-bc], and so the needed lexical constraints include M1=ab (i.e., M1 must be [ab]), M1=c, M1=a, M1=bc, etcetera.
As both of these solutions depend on the UR being one of the SRs (i.e., the basic alternant constraint; see Kenstowicz and Kisseberth Reference Kenstowicz and Kisseberth1979: 202), abstract URs that contain underspecified segments will present a challenge.Footnote 17 Pater et al. (Reference Pater, Jesney, Staubs, Smith, Cahill and Albright2012) address this by allowing for different URs to be selected in different contexts (i.e., overspecification).Footnote 18 With the inclusion of the UR constraints, their MaxEnt grammar identifies the most likely (UR, SR) combination for a given meaning in a way that captures broader generalizations in the language while still allowing for non-alternating morphemes (e.g., the three-way voicing contrast in Turkish analyzed by Inkelas et al. Reference Inkelas, Orgun, Zoll and Roca1997 as a case of underspecification).Footnote 19
Lastly, the hidden structure learning problem has also been explored in the context of the intermediate representations of derivational theories such as stratal OT (Bermúdez-Otero Reference Bermúdez-Otero1999, Reference Bermúdez-Otero, Spenader, Eriksson and Dahl2003; Kiparsky Reference Kiparsky2000) and HS (McCarthy Reference McCarthy, Hirotani, Coetzee, Hall and Kim2000). Staubs and Pater (Reference Staubs, Pater, McCarthy and Pater2016) show how the order of operations in an HS derivation can be established through the constraint weights assigned by a MaxEnt learner tasked with maximizing the likelihood of the observed SRs. Following Eisenstat (Reference Eisenstat2009), they take the probability of an SR to be the summed probability of the UR–SR mappings that could have generated it. Extending this to HS, the probability of an SR is the summed probability of the derivations that could have generated it. A derivation’s probability is the joint probability of its steps, with the probability of a step being the SR’s share in the distribution over the candidate set. The initial step makes use of UR constraints to identify the most likely UR for a given meaning. Following assumptions of HS, subsequent steps identify the most likely candidate among a set generated by applying a single operation (e.g., one added stress or segment). Nazarov (Reference Nazarov2016) and Nazarov and Pater (Reference Nazarov and Pater2017) extend this approach to stratal OT, in which a word-level grammar is followed by a phrase-level one in which the constraint ranking/weighting can potentially differ.
As the scope of the work in this section has made clear, constraint-based grammars have enabled great progress on a variety of learning challenges, including noisy data and hidden structure. Following the theoretical assumption that the constraint set CON is innate and universal, the learners discussed previously in this Element are all in practice provided with the constraints relevant to the patterns in question. The work reviewed in the next section considers the alternative possibility that the constraints themselves are also learned, creating the potential for future work to integrate such a step into methods for grammar and lexicon learning.
3.2.3 Learning Constraints
Ellison (Reference Ellison1991, Reference Ellison1992) describes an MDL approach to learning the inviolable constraints on a language’s representations assumed by work in declarative phonology. As discussed in Section 2 in the context of learning rule-based grammars, MDL assesses a hypothesized grammar in terms of the size of the grammar itself as well as the encoding of the data with respect to that grammar, with the goal of minimizing that combined sum. A constraint template is assumed so that the cost of the template can be levied once regardless of how many constraints instantiate it. Constraints are selected iteratively, such that the value of adding each constraint can be assessed in comparison to a version of the grammar that lacks it. The search for constraints is terminated when the grammar can no longer be improved, meaning the cost of adding another constraint is not sufficiently balanced by a reduction in the cost of encoding the data.
Turning now to the learning of violable constraints, the highly influential MaxEnt phonotactic learner of Hayes and Wilson (Reference Hayes and Wilson2008) learns both the constraint weights and the constraints themselves. As with the MDL learner just described, the MaxEnt learner works from a constraint template in the form of a sequence of feature matrices bounded by a specified length. Constraints are selected from this hypothesis space using the heuristics of accuracy and generality. A constraint’s accuracy is defined as an observed/expected (O/E) ratio of violations in the data under the currently hypothesized grammar. Generality means priority is given to constraints that are shorter and that include larger natural classes. The current constraint set is reweighted with each new constraint addition, and the search terminates when no constraints are left that are sufficiently accurate (or when a grammar of a designated size has been found).Footnote 20
As a phonotactic model, MaxEnt interprets well-formedness as a probability distribution over all possible SRs. Since the focus is only on SRs, a candidate’s probability is not conditioned on a particular UR, and faithfulness constraints play no role (i.e., only markedness constraints are learned). An individual SR’s probability represents its share of the total maxent values (= e raised to the negation of the harmony score defined previously in Section 3.1) of all possible SRs. The learning objective is then to find the weights that maximize the probability of the observed forms and minimize the probability of unobserved ones. Modifications and extensions of this approach have been applied to a range of phonological learning problems, including learning the features as well as the constraints (Nazarov Reference Nazarov2016), distinguishing between true constraints and accidental gaps (Wilson and Gallagher Reference Wilson and Gallagher2018), the learning of nonlocal constraints (Gouskova and Gallagher Reference Gouskova and Gallagher2020), the potential for a naturalness bias to rule out “accidentally true” constraints that hold without exception in the data but are not part of the speakers’ grammatical knowledge (Hayes and White Reference Hayes and White2013), and the use of n-gram probabilities as a way of reducing the size of the hypothesis space of constraints (Nelson Reference Nelson, Ettinger, Hunter and Prickett2022).
3.3 Theory Comparison
To wrap up the discussion of constraint-based theories of phonology, this section turns to the use of computational and quantitative methods for theory comparison, including the comparison between rule- and constraint-based theories as well as comparisons among different constraint-based theories. These comparisons have been made using a variety of considerations, including formal complexity, learnability and convergence results, and expressivity with respect to questions of long-standing theoretical interest. In what follows, these will each be discussed in turn.
3.3.1 Complexity
The most common criteria by which rule- and constraint-based grammatical formalisms have been compared to each other include empirical adequacy, explanatory redundancy, and potential for corresponding learning algorithms, but computational complexity has also played a role. Within computational phonology, the well-known result that SPE grammars are regular relations provided the rules do not reapply to their own structural changes (Johnson Reference Johnson1972; Kaplan and Kay Reference Kaplan and Kay1994) leads to the inevitable question of whether OT grammars preserve that property.Footnote 21 In short, the answer is no, but a line of work exploring different ways of implementing OT with finite-state machinery provided further insight into the sources of that increased power.
Gerdemann and Hulden (Reference Gerdemann, Hulden, Alegria and Hulden2012) provide a simple proof that OT is capable of generating non-regular relations, using the example grammar shown in Table 6. In the first tableau, the input /aaabb/ violates *ab and is mapped to the optimum [aaa] in which both /b/’s are deleted. (For simplicity, DEP and Ident are not shown, but these are assumed to be ranked above *ab such that candidates like [aaacc] or [aaacbb] are also ruled out.) In the second tableau, the input /aabbb/ is instead mapped to [bbb], in which the two /a/’s are deleted.
UR: /aaabb/ | *ab | MAX |
---|---|---|
[aaabb] | *! | |
☞[aaa] | ** | |
[bb] | **!* |
UR: /aabbb/ | *ab | MAX |
---|---|---|
[aabbb] | *! | |
[aa] | **!* | |
☞[bbb] | ** |
More generally, inputs of the form anbm will always be mapped by this grammar to either an or bm depending on which is larger: n or m. When deletion is the preferred repair for violations of *ab, optimization will insist on deleting as few segments as possible. The relation generated by this grammar ( if ) is not finite-state describable.Footnote 22
A finite-state version of OT then must impose certain restrictions to ensure the generated relations stay within the bounds of regular. For example, Ellison’s (Reference Ellison and Nagao1994) implementation assumes (1) constraints are binary and regular (i.e., can be represented with an FST that maps candidates to their lists of marks), and (2) the candidate set produced by GEN is a regular language. Frank and Satta (Reference Frank and Satta1998) show that an upper bound on constraint violations – after which the grammar cannot make distinctions among candidates – suffices to make OT finite-state describable. However, Riggle (Reference Riggle2004) is able to relax some of these restrictions by focusing on the set of contenders, or candidates that are not harmonically bounded, and using a monolithic evaluator instead of a cascade-style combination of individual constraints.Footnote 23 Next, comparing finite-state implementations of a parametrized metrical grid theory and an OT one (using Karttunen’s Reference Karttunen, Karttunen and Oflazer1998 lenient composition operator to combine the constraints), Idsardi (Reference Idsardi, Raimy and Cairns2009) shows that the latter is far less efficient, requiring forty-five states compared to the two states required by the “rule-based” machine. And in more recent work, Lamont (Reference Lamont2021) shows that OT can generate context-sensitive languages with constraints banning subsequences, and Lamont (Reference Lamont2022) shows that even with simple constraints, OT can actually generate non-pushdown functions.
The regular/non-regular divide is particularly relevant for finite-state approaches, but the generation problem in OT has been a subject of broader concern. Eisner (Reference Eisner and Kay2000) proves that OT is NP-hard by transforming the generation problem into the directed Hamiltonian graph problem, which is itself NP-complete. The proof assumes an OT-variant called Primitive OT (Eisner Reference Eisner, Cohen and Wahlster1997), in which constraints dictate the extent to which constituents can overlap on an autosegmental-like timeline, but Idsardi (Reference Idsardi2006) shows that the same result holds assuming the more standard MAX, DEP, unigram and bigram markedness constraints, and self-conjoined constraints of the sort proposed by Ito and Meester (Reference Ito and Mester2003) to handle co-occurrence restrictions (though see Heinz et al. Reference Heinz, Kobele and Riggle2009 and Kornai Reference Kornai2009 for critical responses). More recently, Hao (Reference Hao2024) shows that the universal generation problem in OT (i.e., generation when CON is not fixed but provided as an input) is PSPACE-complete.
As results such as these make clear, the complexity of OT depends on what, if any, restrictions are assumed for the interacting components of GEN, EVAL, and CON. With respect to CON in particular, one approach to formalizing such restrictions has been the use of a constraint definition language (CDL) that specifies the syntactic primitives and rules of combination for constraints, as well as the means by which they calculate violation marks (see de Lacy Reference de Lacy, van Oostendorp, Ewen, Hume and Rice2011 for more discussion of CDLs). For example, Potts and Pullum (Reference Potts and Pullum2002) use model theory to make the meaning of OT constraints more precise, in particular by characterizing candidates as a class of structures and defining a description logic for the constraints over that class (more will be said about model theory in Section 7.3). They show that a wide range of constraints can be thus described using a limited modal logic, while certain constraints types (e.g., output–output identity and inter-candidate sympathy) cannot (see Riggle Reference Riggle2004 and Jardine and Heinz Reference Jardine, Heinz, Ershova, Falk, Geiger, Hebert, Lewis, Munoz, Phillips and Pillion2016b for additional examples of CDLs). Given how much progress in OT has been driven by proposed additions to CON, CDLs offer a valuable means of studying the computational consequences of such proposals.
3.3.2 Learnability
This section touches on how learning and learnability have been used to compare constraint-based theories. For more comprehensive surveys of these topics, readers are referred to Tesar (Reference Tesar and de Lacy2007), Heinz and Riggle (Reference Heinz, Riggle, van Oostendorp, Ewen, Hume and Rice2011), Albright and Hayes (Reference Albright, Hayes, Goldsmith, Riggle and Yu2014), Tessier (Reference Tessier2017), Jarosz (Reference Jarosz2019), and Heinz and Rawski (Reference Heinz, Rawski, Dresher and van der Hulst2022).
Complexity results of the sort discussed in the previous section have also been established for various learning problems. But Magri (Reference Magri2013a) argues that the relevance of intractability results (which are not unique to OT) is less about choosing among grammatical formalisms and more about their implications for child language acquisition. For example, the strong consistency problem (i.e., finding a grammar that is consistent with most of the data) in OT is intractable, because the cyclic ordering problem (shown by Galil and Megiddo Reference Galil and Megiddo1977 to be NP-complete) can be reduced to it. The weak version, in which the algorithm only needs to detect inconsistency without returning a grammar, is tractable (Tesar and Smolensky Reference Tesar and Smolensky2000), but as discussed previously (Sections 2 and 3.2.1), the phonological learner also needs to be concerned with the grammar’s restrictiveness on account of the subset problem. Magri shows that even when assuming consistent data, the subset problem in OT (i.e., minimize r-measure) is also intractable.
The implications of these results are as follows. Despite the common assumption that CON is innate and universal, constraint demotion algorithms assume an arbitrary constraint set and therefore only make use of the logic of ranking and optimization itself. That logic is sufficient for the weak consistency problem, but the subset problem demands more, and this is true for both batch learners and error-driven online ones. As the latter type are a better model of acquisition stages, Magri conjectures that their limitations with respect to finding restrictive grammars may be overcome by making use of the added structure that distinguishes linguistically plausible rankings (defined, e.g., in terms of particular feature interactions).Footnote 24
With respect to modeling acquisition, Magri (Reference Magri2012) also argues that both constraint demotion and promotion are desirable, if not necessary to capture acquisition trajectories (e.g., the use of different repair strategies over time). The GLA performs both of these operations, but it is not convergent in the general case (Pater Reference Pater2008). Magri (Reference Magri2012) shows that the issue is the balance of promotion and demotion; the latter is needed to guarantee convergence, and so the former cannot overwhelm its effects. A solution exists, however, in the form of calibrating the amount p that winner-preferring constraints are promoted such that it satisfies the formula in (28), where l is the number of un-dominated loser-preferring constraints and w is the number of winner-preferring constraints. Loser-preferring constraints are all demoted by a fixed amount.
(28)
Magri argues that the resulting proof of efficient convergence means the GLA’s lack of convergence cannot be used as evidence in favor of (MaxEnt-)HG – for which error-driven learners have also been proposed (Jäger Reference Jäger, Zaenen, Simpson, Holloway King, Grimshaw, Maling and Manning2007; Jesney and Tessier Reference Jesney, Tessier, Abdurrahman, Schardl and Walkow2009; Tessier Reference Tessier2009; Boersma and Pater Reference Boersma, Pater, McCarthy and Pater2016, but see also Magri Reference Magri2016) – though the desirability of promotion is itself an argument for numerical ranking, a requirement for the calibration solution. Magri (Reference Magri2013b) further shows that any instance of the OT ranking problem can be solved by converting it into an instance of the HG weighting problem, countering previous arguments that HG’s affinity for machine learning algorithms is evidence of its computational superiority over OT. As a demonstration, he shows how the GLA revised with the calibrated promotion amount can be reinterpreted as the perceptron algorithm used to find the weight vectors in HG.
Nonetheless, arguments about convergence and online learning are not the only consideration in the debate over weighted versus ranked constraints. The next section will turn to questions of expressivity, particularly with respect to the handling of gradience, variation, and exceptions.
3.3.3 Expressivity
Constraint-based phonotactic models have served as a proving ground for well-known phonological principles such as the sonority sequencing principle (see Daland et al. Reference Daland, Hayes, White, Garellek, Davis and Norrmann2011) and the obligatory contour principle (OCP) (Leben Reference Leben1973). The latter in particular has received a lot of attention in model comparison studies exploring different options for the source of gradient well-formedness. Frisch et al. (Reference Frisch, Pierrehumbert and Broe2004) propose that the co-occurrence restrictions on consonants in Arabic roots (Greenberg Reference Greenberg1950) are best explained with a gradient version of the OCP, for which the probability of a violation is a function of the consonants’ similarity. Gradient constraints reflect speakers’ knowledge of which patterns are over- and underrepresented in their lexicons, quantified as the ratio of the number of observed (O) consonant pairs to the number of pairs that would be expected (E) if consonants combined freely. In their analysis of Arabic, they show that O/E ratios decrease as similarity increases, a level of detail missed by categorical constraints.
They define the similarity of two consonants in terms of shared natural classes. The effect of subsidiary features (i.e., non-place features that influence the strength of the OCP–place effect) is then determined by the language’s inventory. But Coetzee and Pater (Reference Coetzee and Pater2008) argue that this is too restrictive and cannot account for the role of subsidiary features in the OCP effects in Muna (Austronesian). Their own approach, situated in HG, instead allows for the relevant subsidiary features to be identified for each place by providing constraints for all possible place/subsidiary feature combinations. Taking a different tack, Anttila (Reference Anttila2008) demonstrates that categorical ranked constraints can account for gradience by relating well-formedness to complexity, with complexity defined in terms of how many constraints need to be ranked below faithfulness for a form to surface faithfully. This leads to the complexity hypothesis, which states that O/E is inversely correlated with grammatical complexity: the more complex a structure is, the more underrepresented it will be. In addition to modeling gradience with a categorical grammar, this approach offers clear typological predictions for quantitative patterns: those that obtain regardless of how the constraints are ranked are predicted to be universal, with language-specific variation limited to complexity orderings that depend on the constraint ranking.
This cluster of studies demonstrates the use of O/E values as a test for how well a model’s predictions align with reality. However, Wilson and Obdeyn (Reference Wilson and Obdeyn2009) argue that O/E is not a statistically sound estimate of the OCP’s effects, because it does not account for the interaction of other constraints that might affect the observed frequency of a consonant pair (e.g., positional constraints for the individual members of the pair). Rather than attempting to isolate the effect of a single constraint, competing models should be assessed based on their fit to the data in its entirety. At the same time, just searching for the model that best fits the data can lead to the selection of a model that is more complex and therefore less restrictive as far as the limits it places on cross-linguistic variation. They advocate instead for prioritizing restrictiveness even at the cost of fit – for example, by using the Laplace approximation to combine the measures of data fit and model complexity.
They use this technique to compare MaxEnt grammars with the different versions of OCP–place discussed previously in this Element – namely Frisch et al.’s (Reference Frisch, Pierrehumbert and Broe2004) similarity and Coetzee and Pater’s (Reference Coetzee and Pater2008) acceptability – with their own proposed version that employs language-specific weighted features in which the similarity of two sounds is the sum of the weights of their shared features. Apparent differences in the effect of subsidiary features are explained by the different weights on place features, not as interactions of these feature types. If it appears that subsidiary features contribute differently across places, it is because their influence will be more prominent with low-weighted place features but masked with high-weighted ones. Their weighted features model outperforms the others on the same test cases from Arabic and Muna.
Another prominent argument in favor of weighted constraints is their ability to model the cumulative effects of violating multiple markedness constraints, in contrast to ranked constraint grammars in which only the highest-ranking decisive constraint violation matters (see Pater Reference Pater2009). Because of cumulativity, it is possible for multiple violations of lower-weighted constraints to have a greater effect than more highly weighted ones. It is also possible for single violations of multiple lower-weighted constraints to “gang up” and have more influence than a single more highly weighted constraint. In HG, the harmony of a candidate that violates multiple markedness constraints reflects additive (or linear) cumulativity compared to a candidate that violates only one of them. But this is not necessarily the desired result. In artificial grammar learning (AGL) experiments, Breiss (Reference Breiss2020), Durvasula and Liter (Reference Durvasula and Liter2020), and Breiss and Albright (Reference Breiss and Albright2022) all found evidence that cumulative markedness is greater than the sum of its parts: forms that violate multiple constraints appear to be subject to an added penalty over and above the contributions of the individual constraints.
To test different theories’ ability to capture this effect of superlinearity, Smith and Pater (Reference Smith and Pater2020) compare the performance of stochastic OT, noisy HG, and MaxEnt grammars with the variation of schwa deletion in French. The likelihood of a schwa deleting is conditioned by two contextual factors: whether the following syllable is stressed and whether the schwa follows one or two consonants. They compared the models’ fit on experimental data in which participants indicated whether they would pronounce a schwa in various phonological contexts and found that MaxEnt and noisy HG had the greatest success due to their ability to accommodate superlinearity. Breiss and Albright (Reference Breiss and Albright2022) also found that their experimental results were compatible with a MaxEnt grammar under certain weighting conditions, and further that the strength of the superlinearity effect of two constraints depends on the strength of their restrictions (i.e., how many exceptions are present in the training data).
How to handle lexical exceptions to otherwise productive patterns has itself been a prominent question of interest. It has been shown that stochastic OT’s ranking mechanism allows for a grammatical solution via dually listing morphologically complex forms that vacillate and singly listing those that do not (Zuraw Reference Zuraw2000; Hayes and Londe Reference Hayes and Londe2006). For non-derived forms, however, Moore-Cantwell and Pater (Reference Moore-Cantwell and Pater2016) propose a MaxEnt grammar that includes constraints indexed to particular lexical items. Shih and Inkelas (Reference Shih, Inkelas, Hansson, Farris-Trimble, McMullin and Pulleyblank2016) also call on MaxEnt to model lexically conditioned variation in Mende (Mande; Sierra Leone) tonotactics, in which the relative frequency of different tone melodies depends on part of speech. Their multilevel model includes a base grammar that predicts the overall distribution of tone patterns adjusted by a set of word class-specific weights for the same set of constraints. Similarly, Zymet (Reference Zymet2018, Reference Zymet, Hout, Mai, McCollum, Rose and Zaslansky2019) argues that as a single-level regression model, MaxEnt struggles to balance the contributions of grammatical and lexical constraints by treating them as equally likely sources of explanation for the statistical patterns in the data. In contrast, hierarchical regression’s nesting structure allows the grammar’s contribution to be prioritized by modeling general constraints as fixed effects and lexical constraints as random effects. The overall rate at which a generalization holds across the lexicon is captured as the fixed effect, with random effects accounting for lexically specific deviations from that rate.
As much of the work reviewed in this section makes clear, capturing non-categorical aspects of phonological knowledge has been an area of great interest. The next section focuses on one particular type of knowledge – phonotactics – and the various ways lexical statistics and other sources of information have been used to account for the observed gradience in acceptability judgments.
4 Gradient Acceptability and Lexical Statistics
It has long been recognized that speakers have knowledge of which sound sequences are and are not possible in their languages (i.e., phonotactic knowledge). A classic example is blick versus *bnick; neither is an actual word of English, but speakers reliably recognize that only the former is a possible word (Chomsky and Halle Reference Chomsky and Halle1965). Put another way, blick is treated as an accidental gap in the lexicon, whereas *bnick is prohibited from the lexicon due to some type of grammatical constraint. The observed contrast between pairs like blick/*bnick suggests a categorical model in which words are either allowed or disallowed based on their particular sequencing of sounds, and the presence of a single disallowed sequence is enough to condemn the entire word. Indeed blick and bnick are nearly identical, so they cannot be bad across the board. Rather, the problem with *bnick is isolated to a subword component, namely the sequence #bn.Footnote 25
Coleman (Reference Coleman1996) tested the psychological reality of such constraints by collecting acceptability judgments from English speakers for matched nonce words with and without a phonotactic violation (e.g., *mlisless versus glisless). The task was categorical (e.g., forced choice response to whether the word is or could be a word of English), and a word’s overall rating was calculated as the proportion of participants who accepted it. Contrary to the predictions of the categorical model, the results indicated that a single disallowed subword component does not in fact make the word completely unacceptable. In addition, words that did not violate any constraints but were made up of low-frequency components were rated lower than words with high-frequency components, despite both word types being fully grammatical.
This effect of frequency is taken as evidence that speakers are not just aware of what is and is not possible, but can draw on more detailed knowledge of distributional patterns to assess the gradient acceptability of actual and novel words. But this in turn raises the question of which lexical statistics best account for this knowledge (i.e., what frequencies are speakers attuned to). A baseline approach defines phonotactic probability with an n-gram model in which probabilities are assigned to sequences of length n based on their frequency in a training corpus. For example, in a bigram model (n = 2), the probability of the sequence bl is calculated as in (29),
(29)
where C(bl) is the number of times bl appears in the corpus and C(b) is the number of times b appears (i.e., the unigram count of b).Footnote 26 The probability of an entire word is the product of the probabilities of its component bigrams. The difference in acceptability between blick and *bnick then would be accounted for if according to a corpus of English.
This baseline model, however, fails to capture the fact that bn as a sequence is not itself problematic (e.g., subnet, abnormal, abnegate). It is only a problem in a particular position, namely the beginning of the word. One way to address this flaw would be with a trigram model, in which case we would expect .Footnote 27 But Vitevitch et al. (Reference Vitevitch, Luce, Charles-Luce and Kemmerer1997) go further to assess the role of position in phonotactic acceptability. They constructed nonce CVC syllables of high and low probability, with probability determined using both transition (i.e., bigram) probabilities and the probabilities of segments in particular positions (initial-medial-final). Bisyllabic forms were then created that varied in stress placement and covered all possible combinations of high- and low-probability syllables. Their task elicited gradience directly by asking participants to rate words on a scale from 1 (Good) to 10 (Bad). They found that stress placement as well as probability played a role in these judgments: words with first-syllable stress were rated higher than forms with second-syllable stress, and words with two high- (low-) probability syllables were rated highest (lowest) overall.
Similarly, the probabilistic parser of Coleman and Pierrehumbert (Reference Coleman, Pierrehumbert and Coleman1997) incorporates the role of prosodic structure via a (non-recursive) context-free grammar in which the non-terminals encode stress, syllable, and positional information. In the example rule in (30), S, O, and R represent syllable, onset, and rhyme, respectively, and the “sf” designation further indicates that these components are stressed and final.
(30)
Each word component is assigned a path based on the rules that generate it, and the probability of a path is computed from a parsed training corpus. The probability of an entire word is then the product of its component paths. While the probability of a word’s worst component was found to be significantly correlated with the experimental results of Coleman (Reference Coleman1996), the strongest predictor was the log probability of the entire word (with log probabilities being used to address the effect of word length). In this global and probabilistic approach to word acceptability, the presence of well-formed subword components can mitigate the effect of unattested or infrequent ones.
Another whole-word conception of likelihood is in terms of overall similarity to the existing words in the lexicon. Greenberg and Jenkins (Reference Greenberg and Jenkins1964) define this similarity in terms of the number of substitutions needed to convert a nonword into an actual word, and Ohala and Ohala (Reference Ohala, Ohala, Ohala and Jaeger1986) provide experimental evidence that English speakers are in fact sensitive to varying degrees of similarity among nonwords.Footnote 28 This conception of a word’s similarity has been extended to define its neighborhood density, or the number of existing words it is similar to. For example, blick is one substitution away from a number of existing words such as brick, black, block, slick, click, flick, blip, etcetera. How similar an existing word has to be to count as a “neighbor” can vary, but a common definition is that it requires a single string edit operation, to include additions and deletions as well as substitutions.
Of course, phonotactic probability and neighborhood density are correlated – forms with low- (high-) probability subword components will also have small (large) neighborhoods – making it difficult to tease apart which one is responsible for speakers’ acceptability ratings. The study of Bailey and Hahn (Reference Bailey and Hahn2001), however, aimed to do exactly that. Moving beyond the “single string edit” definition of a neighbor, they instead propose a more sophisticated generalized neighborhood model, in which edit operations are weighted to reflect the phonological distance between two forms (e.g., back differs from both pack and sack by one substitution, but the difference between /b/ and /p/ is smaller than the difference between /b/ and /s/ in terms of shared features). They found that both phonotactic probability and lexical neighborhoods have significant effects on acceptability, and that one is not subsumed by the other. However, given that less than half of the variance in their collected ratings was accounted for with these measures combined, they ultimately conclude that there is still more to understand about these factors in particular and gradient acceptability in general.
When teasing apart the relative influences of phonotactic probability and neighborhood density, Shademan (Reference Shademan, Baumer, Montero and Scanlon2006) argues for a need to consider task effects. In particular, the inclusion of actual words in the set of experimental stimuli (as in Bailey and Hahn Reference Bailey and Hahn2001) may amplify lexical effects, as was previously suggested by Vitevitch and Luce (Reference Vitevitch and Luce1998) in the context of word-processing tasks. Shademan’s own experiments tested words in the four logically possible combinations of high/low probability and high/low lexical similarity. When only nonce words were included, probability had a greater correlation with the acceptability ratings (scored from 1 to 7) than lexical similarity. When both actual and nonce words were included, the effect of lexical similarity became more pronounced, though it still was not as strong as the effect of probability.
Even with their relative roles still in question, phonotactic probability and neighborhood density do not tell the whole story of gradient acceptability. Other work has identified significant contributions of phonological knowledge beyond prosodic structure. Hay et al. (Reference Hay, Pierrehumbert, Beckman, Local, Ogden and Temple2004) tested nonce words containing nasal-obstruent clusters that vary in their frequency (e.g., [nt] is very frequent, [mθ] is unattested, and [nf] is attested but infrequent). Cluster frequency was highly correlated with participants’ acceptability ratings (which were determined on a scale of 1 to 10), at least for the attested clusters. The unattested clusters unexpectedly defied this pattern, being rated higher than some low-probability attested clusters. A follow-up experiment suggested that the additional information speakers might be incorporating into their assessment of nonce word likelihood includes morphological analysis (i.e., parsing unattested clusters as spanning a morpheme boundary) as well as long-distance phonological effects like the OCP.
Another potentially valuable source of information missed by basic n-gram models is the feature-based representations of segments, as in the Hayes and Wilson (Reference Hayes and Wilson2008) phonotactic model (discussed previously in Section 3.2.3). To further explore this potential, Albright (Reference Albright2009) uses the probability of sequences of natural classes in addition to specific segments to estimate the likelihood of an onset cluster that is not present in a training corpus. For example, both [bd] and [bn] are unattested onsets in English, which would be treated similarly by a bigram model. By generalizing over features instead, Albright’s model predicts the preference for [bn] because it has more features in common with attested onsets compared to [bd].
Yet another line of research has instead questioned the initial assumption that gradient acceptability ratings reflect gradient grammatical knowledge. In addition to the explanatory power of phonotactic probability and neighborhood density not being complete, Gorman (Reference Gorman2013) shows that it is not even consistent, as in some cases these gradient models are outperformed by a baseline categorical model that only considers whether a word includes an illicit component. Like Shademan (Reference Shademan, Baumer, Montero and Scanlon2006), Gorman suggests the potential for task effects to play a role in gradient acceptability, including the implications of asking participants to use a scale in the first place (e.g., Armstrong et al. Reference Armstrong, Gleitman and Gleitman1983 show that when given the option to use intermediate ratings, participants will provide gradient judgments on how odd or even a number is, even though these properties are categorical by definition).
Kahng and Durvasula (Reference Kahng and Durvasula2023) also directly challenge the assumption that gradient acceptability reflects an underlying grammar of gradient generalizations, arguing that the perceptual system introduces bias and variance that can influence the cline of acceptability. In their experiment, Korean speakers rated forms with illicit clusters lower than those in which a vowel breaks up the cluster. More surprising, clusters with [c] as the first consonant were rated higher than those with [b] as the first consonant, even though both clusters are unattested. This discrepancy is hard to explain as a gradient generalization. A feature-based approach would actually predict [bC] to be better than [cC], since the former has more features in common with attested sequences (i.e., nasal clusters are attested, so the voicing of [b] should give it an advantage).
In addition to rating acceptability, participants were also asked to identify the medial vowel in the stimuli (with “no vowel” as an option). Factoring that information into the analysis, the authors found that (1) participants rated forms more acceptable when they heard an illusory vowel in the disallowed clusters, and (2) they were more likely to hear illusory vowels in [cC] compared to [bC]. Based on this, they propose a model in which categorical grammatical constraints operate over a probability distribution of perceptual representations. Hearing an illusory vowel in illicit forms results in a perceived licit form that the grammar recognizes as well formed, with the likelihood of hearing these vowels (i.e., the source of gradience) attributed to the perceptual system. They conclude with a larger suggestion that proposals for gradient grammatical generalizations based on acceptability judgments should be supported by an investigation into what participants actually perceived.Footnote 29
How best to account for phonotactic knowledge remains an ongoing question of interest. We will return to research on phonotactics – particularly how they are learned – in Section 7.1. But first, the next two sections will briefly survey the past and current phonological research employing information theory and neural networks, respectively.
5 Information Theory
This section demonstrates the utility and flexibility of information theoretic methods by highlighting examples of their application to a range of problems of phonological interest.Footnote 30
5.1 Features and Natural Classes
A great deal of work on phonological learning assumes – out of either principle or convenience – an innate feature set that can be used to define natural classes of sounds, but distributional data has also been used to induce those classes directly. The four-step method proposed by Mayer (Reference Mayer2020) makes use of vector embeddings of sounds that represent the important aspects of their distribution, such as counts of all trigrams that include the target sound (normalized using positive pointwise mutual information, or PPMI). Phonological classes are then identified using principal component analysis (PCA) to reduce dimensionality, followed by k-means clustering over each principal component. Because the number of clusters is not known in advance, the Bayesian information criterion (BIC) (Schwarz Reference Schwarz1978) is used to find the value of k that best balances model complexity (= number of clusters) and fit (= distance from the cluster centroids). Principal component analysis and clustering are performed recursively on discovered classes until the latter only identifies a single cluster.Footnote 31
5.2 Allophones and Neutralization
Peperkamp et al. (Reference Peperkamp, Le Calvez, Nadal and Dupoux2006) use the Kullback–Leibler divergence metric to compare the distributions of pairs of segments, taking a high value to indicate an allophonic relationship. Which of the pair is the allophone is determined using relative entropy, assuming the allophone will have higher entropy than the phoneme. To incorporate phonological knowledge, such as the fact that allophones tend to be similar to their phonemes, as well as the contexts they appear in, they add linguistic filters to weed out pairs of segments that happen to meet their selection criteria but lack these relationships. One clue is if a third segment exists that is intermediate between the two. Another is if the allophone is more distant from its context than the phoneme from which it is derived. The need for these filters means that distributional information alone is insufficient to detect true allophonic relationships; prior knowledge about which pairs of sounds might be allophones is also needed. Calamaro and Jarosz (Reference Calamaro and Jarosz2015) extend this model to learn cases of neutralization, in which two sounds that otherwise contrast are complementary in a particular context (i.e., have partially overlapping distributions).
5.3 Phonotactics and Phonological Structure
Goldsmith and Riggle (Reference Goldsmith and Riggle2012) demonstrate the value of information theoretic methods for assessing the need for and contribution of phonological structure. The average positive log probability (i.e., entropy) of a dataset under a model tells us how surprising or accidental the data is according to that model. The difference in entropy between two models reveals the contribution (either positive or negative) of added structure. For example, a comparison of unigram (no structure) and bigram (linear structure) models can motivate something akin to bigram markedness constraints: absent a constraint against a sequence ab, the probability of that sequence is the product of the individual probabilities of a and b (i.e., these are independent events). But the probability of ab being greater or less than this joint probability signals an interaction between them (i.e., they are more or less likely to occur together than apart).
Using Finnish vowel harmony as an example, they show that a bigram model reduces entropy (i.e., assigns higher probability) compared to a unigram model, justifying its added structure. From there they conduct further comparisons with bigram models with more structure, such as simulating autosegmental tiers using bigrams over classes. Following Goldsmith and Xanthos’s (Reference Goldsmith and Xanthos2009) procedure for discovering phonological categories, they find the partition of Finnish segments into two categories (each a probability distribution over its segments) that maximizes the probability of the data (computed with a two-state hidden Markov model). The resulting partition separates the inventory into vowels and consonants. Repeating this process on just the vowels discovers the categories of front and back that are relevant to Finnish’s vowel harmony patterns, with neutral vowels given close to equal emission probabilities from both states. With this model of the vowel tier, a word’s probability is the product of the probability of its sequence of vowels and the bigram probability of the original string, in which all vowels are collapsed to the symbol V. Because the resulting cost of this model is actually higher than the unaugmented bigram model, the authors appeal to a Boltzmann model to combine unigram, bigram, and vowel-to-vowel probabilities. By capturing nonlocal vowel-to-vowel dependencies as well as less commonly recognized local consonant–vowel effects, their model provides a better fit than either the unigram or bigram models alone.
The broader contribution of this work is the use of information theoretic model comparison methods to justify the increased complexity of added structure rather than assuming its inclusion as a matter of course.Footnote 32 These methods also provide the opportunity to verify assumptions that theories otherwise make for us by default – for example, that a language with vowel harmony necessitates a tier-based representation in which consonants do not contribute information. Goldsmith and Riggle’s (Reference Goldsmith and Riggle2012: 880) case study on Finnish reveals the “more complex linguistic reality” that even in the presence of vowel harmony, consonants may in fact condition the choice of a following vowel.
6 Neural Networks
The potential of connectionist models of morpho-phonology has long been recognized (e.g., Rumelhart and McClelland Reference Rumelhart, McClelland, Rumelhart and McClelland1986; Gasser and Lee Reference Gasser and Lee1990), though perhaps underutilized, with current developments sparking renewed interest. Due to space limitations, this section will be necessarily brief, but readers are referred to Alderete and Tupper (Reference Alderete, Tupper, Bosch and Hannahs2018) as well as any of the works cited in what follows for more context on and examples of the use of neural networks for phonological learning, modeling, and theorizing. Pater (Reference Pater2019) in particular provides an accessible introduction to neural networks and a valuable discussion of their history and potential for greater integration with generative linguistics research.Footnote 33 Taking an even stronger stance on the future, Boersma et al. (Reference Boersma, Benders and Seinhorst2020) argue that only a neural network model will be capable of accounting for the full range of behavioral data associated with the phonetics – phonology interface.
As articulated by Goldsmith (Reference Goldsmith, Lima, Corrigan and Iverson1992a), neural networks offer a means for conducting phonological analysis without committing to typical generative assumptions of a highly structured LAD, as well as a clearer route to drawing connections with other areas of cognitive science. As an example, Goldsmith (Reference Goldsmith and Davis1992b) presents a case study on using a neural network to model stress patterns. The units of the network correspond to the bottom row of a metrical grid (i.e., syllables), with stress corresponding to local maxima (greater activation than neighboring units). Activation may be inherent (determined by syllable position or weight) or derived by lateral inhibition of a neighboring unit. Particular settings of inherent activation and the inhibitory weights simulates effects like alternating stress and avoidance of stress clash. These effects are achieved through iterative recomputation until equilibrium, rather than through an ordered derivation that manipulates hidden structure. Goldsmith and Larson (Reference Goldsmith, Larson, Canakis, Chan and Denton1993) further show how syllabification can be modeled using activation to encode a segment’s level of sonority, where a local maxima this time identifies a syllable’s nucleus. Language-specific constraints or parameters are the result of different weights that control the spread of activation through the network, as well as activation thresholds that determine what counts as a peak.Footnote 34
Given that the potential for these models to provide insight into language acquisition and cognition depends on their interpretability, several researchers have sought evidence of linguistic structure in the patterns of activation of hidden layers. For example, Alishahi et al. (Reference Alishahi, Barking, Chrupała, Levy and Specia2017) present results from experiments showing that phonemes can be recovered from the hidden layer representations of a recurrent neural network (RNN) trained to map pairs of spoken language and images into a semantic space. Success of phoneme recovery was greatest with the early (first and second) hidden layers and then decreased with each successive layer. Similarly, Smith et al. (Reference Smith, O’Hara, Rosen, Smolensky, Ettinger, Pavlich and Prickett2021) test a gestural harmony (Smith Reference Smith2018) account of stepwise vowel harmony with an encoder–decoder model that maps strings of phonological units to sequences of articulator movements. The decoder’s patterns of attention indicated that it was attentive to states corresponding to harmony-triggering vowels throughout the span of the word, consistent with the gestural overlap account of vowel harmony.
A clear advantage of neural networks for modeling acquisition is their ability to work with raw speech data directly, rather than the discrete input representations assumed by the majority of phonological learning models. This potential integration of phonetic and phonological learning is explored with a generative adversarial network (GAN) in Beguš (Reference Beguš, Ettinger, Jarosz and Nelson2020a, Reference Beguš2020b). Generative adversarial networks consist of a generator network tasked with generating data and a discriminator network tasked with determining whether an input is real or generated. The two networks are trained in tandem such that the discriminator’s pattern of errors is used by the generator to improve its ability to generate data (i.e., its ability to fool the discriminator). Beguš (Reference Beguš, Ettinger, Jarosz and Nelson2020a, Reference Beguš2020b) used English data to train a GAN to generate voiceless stop-vowel sequences with and without an initial /s/. In the generated samples, the voice onset time (VOT) of the vowel was significantly shorter in the presence of the initial /s/, as predicted by the English pattern of allophonic variation between aspirated and unaspirated voiceless stops. To again address the question of interpretability and better understand the actual learning mechanisms at work, Beguš proposes a logistic regression–based method for finding correlations between the model’s latent space and output variables such as presence versus absence of /s/.
In other work, neural networks are used to test the necessity of various types of linguistic structure for learning phonological patterns. For example, Doucette (Reference Doucette, Gibson, Linzen, Sayeed, van Schijndel and Schuler2017) showed that phonotactic learning with an RNN is possible without relying on repeated features (i.e., alpha variables), and Mayer and Nelson’s (Reference Mayer, Nelson, Ettinger, Jarosz and Nelson2020) RNN phonotactic learner performed comparably to Hayes and Wilson’s (Reference Hayes and Wilson2008) MaxEnt learner on Finnish vowel harmony, even without the augmentation of a tier. Prickett and Pater (Reference Prickett, Pater, Ettinger, Hunter and Prickett2022) likewise forgo the need for prespecified constraints with an encoder–decoder model that achieved state-of-the-art accuracy on Tesar and Smolensky’s (Reference Tesar and Smolensky2000) stress pattern dataset and correctly generalized 112 of its 124 patterns. Delving further into learning biases, Prickett (Reference Prickett2019) shows that learning in a sequence-to-sequence model can simulate proposed biases for learning process interactions, namely maximal utilization (all rules apply maximally; Kiparsky Reference Kiparsky, Bach and Harms1968) and transparency (interactions are not opaque; Kiparsky Reference Kiparsky and Dingwall1971). And Prickett (Reference Prickett2021) argues that such models also capture formal language theoretic complexity biases of the sort discussed in the next section.
In addition to addressing the challenges of interpretability, future research using neural networks will hopefully shed light on how the choice of architecture and other design options affect what they can learn, as well as how these choices relate to the kinds of grammatical distinctions that linguists make. One approach to understanding neural networks has been to draw on tools from formal language theory (see Merrill Reference Merrill, Drewes and Volkov2023 for an overview), which we turn to in the next section.
7 Formal Language Theory (FLT)
A formal language theoretic approach to phonology emphasizes the formal structure of linguistic patterning in order to identify abstract universal properties, which in practice often relate to computational complexity. This formal structure is recognized by first representing phonological patterns with mathematical objects. For example, a phonotactic constraint like *[−son, +voice]# (i.e., words cannot end with voiced obstruents) can be represented with a set of strings that do not violate it (e.g., {aba, ba, ap, pa, …}) or a function that maps a given string to 0 or 1 depending on whether it violates the constraint (e.g., f(aba) = 0, f(ab) = 1, etc.).Footnote 35 Similarly, a phonological rule can be represented with a function (obligatory rule) or relation (optional rule) that maps an input to an output or set of outputs, respectively (e.g., f(ab) = ap or {ab, ap}). The advantage of representing patterns in this way is the ability to identify their invariant structural properties (examples of which will be discussed in what follows), which hold regardless of the choice of grammatical formalism (rules, constraints, etc.).
Importantly, this emphasis on structure is fully compatible with statistical and quantitative approaches. For example, Hayes and Wilson’s (Reference Hayes and Wilson2008) MaxEnt phonotactic learner (discussed in Section 3.2.3) makes use of a template for constraints in order to structure and narrow the hypothesis space. Formal language theory studies different kinds of templates, identifying what kinds of patterns they can and cannot express and what distinctions algorithms have to make to learn them. These results will be true regardless of whether or not the constraints are weighted or used to derive probability distributions.
Foundational work in this vein includes the aforementioned (Section 3.3.1) independent discovery by Johnson (Reference Johnson1972) and Kaplan and Kay (Reference Kaplan and Kay1994) that SPE grammars are regular provided the rules do not reapply to their own structural changes.Footnote 36 This means an individual phonological rule can be compiled into an FST), and – since the regular relations are closed under composition – an ordered set of rules can likewise be represented with a single FST (i.e., the entire grammar is also regular). This finding established a computational boundary between phonology and other domains like morphology and syntax, which include patterns that are more complex than regular (see Carden Reference Carden1983; Culy Reference Culy1985; Shieber Reference Shieber1985; Kobele Reference Kobele2006; Heinz and Idsardi Reference Heinz and Idsardi2011, Reference Dillon, Dunbar and Idsardi2013). It also indicated that while SPE captures the basic intuition that phonological changes affect sounds in particular contexts, as a grammatical formalism, context-sensitive rules are more powerful than necessary for phonology.
The finite-state modeling of phonology has also been a productive route to the development of software for implementing morpho-phonological systems (Beesley and Karttunen Reference Beesley and Karttunen2003; Hulden Reference Hulden and Kreutel2009; Aksënova Reference Aksënova2020; Gorman and Sproat Reference Gorman and Sproat2021). Koskenniemi’s (Reference Koskenniemi1983) two-level rule approach in particular has served as the basis for morphological analyzers for several languages, including low-resource languages that are less amenable to deep-learning approaches (e.g., Çöltekin Reference Çöltekin, Calzolari, Choukri, Maegaard, Mariani, Odijk, Piperidis, Rosner and Tapias2010 and Washington et al. Reference Washington, Ipasov, Tyers, Calzolari, Choukri, Declerck, Doğan, Maegaard, Mariani, Moreno, Odijk and Piperidis2012).
Regularity then provides a well-defined proposed computational universal that is sufficiently expressive while ruling out a great many non-phonological patterns. However, Heinz (Reference Heinz2011a, Reference Heinz2011b) – the culmination of a line of work initiated in Heinz (Reference Heinz2007, Reference Heinz2009, Reference Heinz2010b) – offers the further hypothesis that phonological patterns are in fact subregular and belong to proper subsets of the regular languages and relations. This hypothesis is motivated by (1) typology, as the regular classes still admit phonologically implausible patterns; and (2) learnability, as the regular classes are not learnable under a variety of settings, including in the limit from positive data (Gold Reference Gold1967) and the probably approximately correct framework (Valiant Reference Valiant1984, Reference Valiant2013).
With respect to typology, the subregular hypothesis offers clear and testable predictions designed to advance our understanding of the nature of phonological computation. For example, Gainor et al. (Reference Gainor, Lai, Heinz, Choi, Hogue, Punske, Tat, Schertz and Trueman2012) and Heinz and Lai (Reference Heinz, Lai, Kornai and Kuhlmann2013) show that progressive and regressive vowel harmony is not just regular but subsequential (i.e., deterministic), and dominant-recessive and stem-controlled harmony patterns are what they call weakly deterministic.Footnote 37 In contrast, unattested patterns like Sour Grapes (Padgett Reference Padgett, Suzuki and Elzinga1995; Wilson Reference Wilson2003) and Majority Rules (Lombardi Reference Lombardi1999; Baković Reference Baković2000) fall outside of these boundaries (the latter is in fact non-regular). In the same vein, Jardine (Reference Jardine2016a) argues that tonal patterns regularly exhibit greater computational complexity compared to segmental ones.
Such hypotheses are informed by the current state of knowledge of what is and is not attested, and therefore serve to highlight what kinds of patterns we should be looking for in order to extend that state of knowledge. In response to Heinz and Lai (Reference Heinz, Lai, Kornai and Kuhlmann2013) and Jardine (Reference Jardine2016a), McCollum et al. (Reference McCollum, Baković, Mai and Meinhardt2020) and Meinhardt et al. (Reference Meinhardt, Mai, Baković and McCollum2024) present vowel harmony patterns that meet Jardine’s (Reference Jardine2016a) definition of unbounded circumambience (= the conditions for a change require unbounded lookahead in both directions from the target) and therefore serve as evidence that challenges his argument.Footnote 38 The latter work further draws a distinction between such patterns and those they call unbounded semiambient (= the conditions for a change require unbounded lookahead in at most one direction from the target).Footnote 39 These kinds of typological investigations thus result in valuable, nuanced characterizations of the computations necessary to recognize or represent a phonological pattern.
With respect to learnability, the fact that the regular languages and relations are not learnable from positive data means the property of regularity does not sufficiently limit the hypothesis space of a phonological learner. In contrast, the subregular properties that delimit proper subsets of regular do enable learning under these conditions. The next two sections review the work demonstrating that potential in the learning of phonotactics and mappings, respectively, much of which builds on computational learning theory (Osherson et al. Reference Osherson, Weinstein and Stob1986; Jain et al. Reference Jain, Osherson, Royer and Sharma1999; Mohri et al. Reference Mohri, Rostamizadeh and Talwalkar2018) and grammatical inference (de la Higuera Reference de la Higuera2010; Heinz and Sempere Reference Heinz and Sempere2016; Wieczorek Reference Wieczorek2017).
7.1 Phonotactic Learning
Any phonotactic learner must assume something about the hypothesis space of possible constraints that it navigates. As noted previously, FLT-based approaches to phonotactic learning prioritize the nature of those assumptions by characterizing the formal structure of the patterns themselves. For example, focusing on stress patterns, Heinz (Reference Heinz2007, Reference Heinz2009) formalizes phonological locality with a property called neighborhood-distinctness, defined in automata-theoretic terms as not containing multiple states that share the same set of incoming and outgoing paths of designated lengths (or locality windows). A survey of 109 stress patterns compiled by Bailey (Reference Bailey1995) and Gordon (Reference Gordon2002) – now available in the StressTyp2 database (Goedemans et al. Reference Goedemans, Heinz and van der Hulst2015) – revealed that all of the patterns have this property. Furthermore, roughly 75 percent of them are strictly local (SL) (Edlefson et al. 2008; Rogers and Lambert Reference Rogers and Lambert2019), which means they belong to a highly restrictive class of formal languages recognizable by devices that only track contiguous substrings of bounded length (McNaughton and Papert Reference McNaughton and Papert1971; Rogers et al. Reference Rogers, Heinz, Bailey, Edlefsen, Visscher, Wellcome, Wibel, Ebert, Jäger and Michaelis2010; Rogers and Pullum Reference Rogers and Pullum2011). This length is often referred to as the language’s k-value, so, for example, an SL language with k=2 can be represented with a grammar of banned 2-length substrings (called k-factors, or, in this case, 2-factors).Footnote 40
As an example, consider a language that only allows syllables of the form CV, meaning all words in this language are of the form CVn for some integer n. Abstracting away from individual consonant and vowel differences for simplicity, this language can be represented with the SL2 grammar in (31):
(31)
The reader can verify that all strings in CVn are constructed from the 2-factors #C, CV, VC, and V#, none of which are in this grammar. Put another way, any string that violates this language’s phonotactic constraints will contain at least one of the prohibited 2-factors in (31).
As proposed computational universals or at least strong tendencies, properties such as neighborhood-distinctness and strict locality serve to structure and restrict the hypothesis space a learner has to navigate, greatly reducing the number of generalizations it needs to consider. They have also served as the basis for provably correct learning algorithms that establish what kinds of patterns can be learned from data meeting certain criteria. Such proofs of correctness provide a guarantee that any pattern from any language that has the property in question can be learned, in contrast with simulation-based approaches in which success on particular languages and patterns must be assessed on a case-by-case basis.Footnote 41
Heinz (Reference Heinz2010b) expands on this approach in an FLT analysis of long-distance phonotactic dependencies – such as consonant harmony – that apply across an arbitrary number of intervening segments. While not SL, both symmetric and asymmetric long-distance agreement patterns can be described with strictly piecewise (SP) or precedence grammars, provided they do not involve blocking. A version of the string extension learner proposed in Heinz (Reference Heinz, Hajič, Carberry, Clark and Nivre2010a) for SL languages is shown to learn SP patterns in the limit from positive data. Furthermore, if long-distance agreement with blocking is unattested, as suggested by the typological surveys of Hansson (Reference Hansson2001) and Rose and Walker (Reference Rose and Walker2004), the SP characterization provides an explanation for that gap.
An SP grammar differs from an SL grammar in that it contains the banned subsequences, or precedence relations among segments (i.e., segment x cannot precede segment y in a string, with potentially other segments intervening between them). Blocking patterns are out of reach because they place an added condition on whether a given subsequence is permitted: segment x cannot precede segment y unless segment z intervenes. Consider a sibilant harmony pattern in which [s] and [ʃ] cannot co-occur in a word unless [k] intervenes. The SP2 grammar for such a language is shown in (32). (Remember the items in this grammar are interpreted as subsequences, not contiguous factors.)
(32)
If, however, the agreement is blocked by [k] – for example, if sakaʃ is well formed but *sapaʃ is not – then we have a contradiction. The subsequence sʃ must be in the grammar to rule out *sapaʃ, but that grammar will then also necessarily (and incorrectly) reject sakaʃ.Footnote 42 Thus the inability of SP to handle long-distance agreement with blocking combined with the typological prediction that such patterns are not possible suggests that precedence is a useful characterization of this category of phonotactic patterns.
However, subsequent work challenged that typological prediction with reported cases of long-distance agreement with blocking (e.g., Hansson Reference Hansson and Heijl2010; Jurgec Reference Jurgec2011). Based on such cases, McMullin (Reference McMullin2016) argues that the tier-based strictly local (TSL) languages defined by Heinz et al. (Reference Heinz, Rawal, Tanner, Lin, Matsumoto and Mihalcea2011) are a better characterization of long-distance patterns. Tier-based strictly local languages are defined with a subset of segments called the tier, over which SL constraints are defined. For example, sibilant harmony (without blocking) can be handled with a tier that includes only sibilants; the strings sapaʃ and sapas will be submitted to the grammar as just sʃ and ss, with non-tier segments removed. The SL2 grammar in (33) then suffices to rule out the former and accept the latter.
(33)
Blocking is handled simply by including the blocking segments on the tier. In the example in which [k] blocks the sibilant harmony, the strings *sapaʃ and sakaʃ are correctly distinguished, because the former (sʃ with non-tier segments removed) but not the latter (skʃ) includes the banned sequence sʃ. As for learning, TSL can be learned by the same algorithms that learn SL, provided the tier is already known, but algorithms also exist for learning both the grammar and the tier (Jardine and Heinz Reference Jardine and Heinz2016a; Jardine and McMullin Reference Jardine, McMullin, Drewes, Martín-Vide and Truthe2017).
Heinz’s (Reference Heinz2010b) hypothesis that phonotactics are either SL or SP fit the assessment at the time that long-distance consonant agreement with blocking is unattested, and it also proposed an explanation for why that is the case (i.e., because of the way phonotactics are learned). The work on TSL that followed was motivated by a revision of that assessment, but equally important is what it revealed about the formal relationship between competing models of phonotactic grammars.Footnote 43 Precedence versus tiers as defined by SP and TSL are not just notational variants or competing ways of thinking about locality; they are distinct, formally defined properties that either do or do not hold of a given pattern.Footnote 44 Neither property subsumes the other, as each can describe patterns the other cannot.Footnote 45
An example of a pattern that is TSL but not SP was already discussed: consonant harmony with blocking. For an example that is SP but not TSL, consider a language with the alphabet {a, b, c, d} with two constraints: “a” cannot precede “b”, and “b” cannot precede “c”. This language is straightforwardly SP2, as witnessed by the grammar in (34).
(34)
As a TSL2 language, this pattern requires a tier of {a, b, c} and the same grammar interpreted as factors instead of subsequences. However, the string acb, in which an “a” precedes a “b”, is incorrectly accepted, because its tier-string (also acb) contains neither of the banned factors.Footnote 46
Formal language theory approaches to phonotactic learning traditionally forgo the use of statistics in favor of grammatical inference techniques that capitalize on the assumed structure of the hypothesis space. But Wilson and Gallagher (Reference Wilson and Gallagher2018) argue that without statistics a feature-based model will be unable to determine which of the many possible feature representations are at the right level of specificity for the constraint it is trying to learn. For example, a language that enforces intervocalic voicing will allow sequences like [igi], [aba], and [ede], but will disallow *[iki], *[apa], and *[ete]. Some featural representations distinguish these two groups, such as , but others do not, such as , or , etcetera. Without statistics to assess the accuracy of these competing constraints, the learner will not be able to converge on the correct one.
In response, Chandlee et al. (Reference Chandlee, Eyraud, Heinz, Jardine, Rawski, de Groote, Drewes and Penn2019) and Rawski (Reference Rawski2021) present a structural inference approach to this problem that exploits the inherent structure of the space of possible feature representations. In particular, substructures such as k-factors form a partial order: [+nasal] is a substructure of [+nasal, +voice], which is in turn a substructure of [+nasal, +voice, +labial], and so on. This inherent ordering among candidate constraints establishes the following grammatical entailments: if a structure is grammatical, so must be all of the structures it contains (i.e., sit below it in the order), and likewise if a structure is ungrammatical, so must be all of the structures that contain it (i.e., sit above it in the order).
The proposed algorithm (the bottom-up factor inference algorithm, or BUFIA) takes advantage of these entailments to greatly reduce the number of constraints it has to consider. As the name indicates, it proceeds bottom up through the order to first consider the “smallest” or most general constraints. If a structure is observed in the input data, then there cannot be a constraint against it, and so the algorithm moves on to the next structures in the order (e.g., if any [+nasal] segment exists, then *[+nasal] is rejected but , etc. are still in the mix). In contrast, if no [+nasal] segment is found, then *[+nasal] can be kept as a constraint, and importantly, no constraint with [+nasal] as a substructure needs to be considered.
To address the redundancy in the set of constraints identified by this process (i.e., all constraints that are maximally general and equally describe the data will be returned), Rawski (Reference Rawski2021) proposes additional abductive principles to prune the search space of constraints, including a requirement that an added constraint must rule out at least one new structure, or the stronger requirement that it must rule out an entirely new set of structures. Such principles share the goals of selection criteria like Hayes and Wilson’s (Reference Hayes and Wilson2008) accuracy and generality or Wilson and Gallagher’s (Reference Wilson and Gallagher2018) measure of gain, but being situated in a deterministic learner they always find the same set of constraints.
7.2 Learning Input–Output Maps
In addition to the research on phonotactics, a parallel line of work in FLT has focused on the characterization and learning of input–output maps. As noted previously (Section 3.2.2), maps are extensional representations of phonological processes, whose properties hold regardless of the grammatical formalism that is chosen to encode them intensionally. This distinction between intensions and extensions is central to the phonological research grounded in FLT. The finding that phonological grammars are regular was first based on an assumption that those grammars consist of a set of rules. Its further exploration in the context of constraint-based grammars (Section 3.3.1) was predicated on the idea that regularity should be preserved even without rules. Likewise, Tesar’s (Reference Tesar2014) ODL learner (Section 3.2.2) capitalized on a property of maps: OT grammars are not necessarily output driven, but those that are (i.e., those that generate an output-driven map) have learnability advantages. This section presents work that has similarly capitalized on the learnability advantages of subregular properties of maps.
The learning of maps, then, is distinct from the learning of rules: no particular rule formalism is assumed nor does it play any role in the learning algorithm. While test cases often refer to individual generalizations (e.g., final devoicing, intervocalic voicing, nasal assimilation, etc.) that one might represent intensionally with a rule, the target of the learner is still a map that could be generated by any number of grammatical devices. Furthermore, a single map can reflect the generalizations of multiple rules, even interacting ones (see Chandlee et al. Reference Chandlee, Heinz and Jardine2018; Chandlee Reference Chandlee, Bakay, Pratley, Neu and Deal2022). The learners surveyed in what follows in fact target classes of maps and will succeed on any map in its class, regardless of how many rules it represents.
The learnability advantage of subregularity for learning maps echoes the previous discussion of learning phonotactics: like the regular languages, the class of regular relations is insufficiently structured to guarantee learning from positive data. Functional counterparts to subregular languages have therefore been employed to address this problem. For example, the SL languages correspond to local functions (Berstel Reference Berstel1982; Vaysse Reference Vaysse1986; Lind and Marcus Reference Lind and Marcus1995; see also Sakarovitch Reference Sakarovitch2009), which compute the output string for a given input string based only on an examination of contiguous substrings (k-factors again) of bounded length. These functions are thus Markovian in that they can only make use of the most recent substring when deciding what to output next (with the degree of recentness determined by the size of k). The local functions have been further distinguished into two classes that differ in whether the examined substring is in the input or output string (Chandlee Reference Chandlee2014), namely the input strictly local (ISL) and output strictly local (OSL) functions.Footnote 47 And just as the TSL languages augment the SL languages with the concept of a tier, tier-based counterparts to ISL and OSL have also been defined to model long-distance processes (Hao and Bowers Reference Hao, Bowers, Nicolai and Cotterell2019; Burness et al. Reference Burness, McMullin and Chandlee2021; Burness Reference Burness2022).
Algorithms exist for all of these classes that can identify any function in the target k-local class (Chandlee et al. Reference Chandlee, Eyraud and Heinz2014; Chandlee et al. Reference Chandlee, Eyraud, Heinz, Kuhlmann, Kanazawa and Kobele2015; Burness and McMullin Reference Burness, McMullin, de Groote, Drewes and Penn2019), or even any class that can be represented with a deterministic transducer (Jardine et al. Reference Jardine, Chandlee, Eyraud, Heinz, Clark, Kanazawa and Yoshinaka2014). Recent work has also tackled the problem of learning both the phonological map and the lexicon by decomposing the function that maps meanings to SRs into a meaning → UR lexicon function and the UR → SR phonological function (Hua et al. Reference Hua, Jardine, Dai, Reisinger and Huijsmans2021; Hua and Jardine Reference Hua, Jardine, Chandlee, Eyraud, Heinz, Jardine and van Zaanen2021). Again the learner capitalizes on the assumption that the phonological function is subregular – specifically k-ISL – to converge on its target grammar.
The algorithms proposed to establish the formal learnability of these classes serve to demonstrate how subregular properties structure the hypothesis space of functions in a way that enables learning from positive data. The proofs of learnability often take the form of first defining a characteristic or sufficient sample and then showing how the algorithm, when given data with that sample as a subset, is guaranteed to converge on the target function. But the often unrealistic nature of characteristic samples – including impossible sequences as well as not allowing for various sources of noise such as optionality, variation, and exceptions – means these algorithms are only the first step toward developing a viable phonological learning model.
For example, in Gildea and Jurafsky’s (Reference Gildea and Jurafsky1996) experiments with a learning algorithm for subsequential functions (the onward subsequential transducer inference algorithm, or OSTIA; Oncina et al. Reference Oncina, García and Vidal1993), they found that it cannot learn the English flapping rule even when given nearly 100,000 (input, output) string pairs derived from the CMU Pronunciation Dictionary. The issue is not the amount of data – OSTIA in fact requires little data compared to statistical learning models – but the type of data. Specifically, it would need to see whether flapping applies to impossible strings, such as /ttt/. Gildea and Jurafsky’s solution is to augment the learner with three phonologically informed learning biases: faithfulness (underlying segments undergo minimal changes), community (segments in natural classes tend to pattern together), and the use of context to identify phonological changes.
More recent work has explored additional routes for overcoming data limitations, including the use of semi-determinism (Beros and de la Higuera Reference Beros and de la Higuera2016) for optionality (Heinz Reference Heinz2020), as well as methods for generalizing over features (Markowska and Heinz Reference Markowska, Heinz, Coste, Ouardi and Rabusseau2023) and for identifying categorical constraints in the presence of exceptions (Dai Reference Dai2022; Wu and Heinz Reference Wu, Heinz, Coste, Ouardi and Rabusseau2023).
7.3 Model Theoretic Phonology
Much of the work mentioned so far has been grounded in the finite-state formalism, but other developments have demonstrated the utility of model theoretic approaches. Graf (Reference Graf and Icard2009, Reference Graf2010) argues that the use of model theory for theory comparison is more efficient than an empirical approach, because theories (and variants of theories) can be grouped together based on how powerful a logic is needed to implement their assumptions. From there, we can assess what classes of phenomena a particular implementation accommodates, rather than separately testing individual patterns in individual theories. This type of investigation also provides a criterion for identifying the crucial distinctions among theories (which can be obscured given their extensive surface differences): specifically, the ones that necessitate an increase in logical power.
In addition, because logical characterizations operate over graphs, and strings are just a particular type of graph, model theory offers a straightforward way to extend string-based definitions of properties like locality to other structures, including nonlinear representations like trees, metrical grids (Liberman Reference Liberman1975; Liberman and Prince Reference Liberman and Prince1977; Prince Reference Prince1983; Halle and Vergnaud Reference Halle and Vergnaud1987; Idsardi Reference Idsardi1988; Hayes Reference Hayes1995), feature geometry (Sagey, Reference Sagey1986; Clements and Hume, Reference Clements, Hume and Goldsmith1995), and even sign (Rawski Reference Rawski, Edmiston, Ermolaeva, Hakgüder, Lai, Montemurro, Rhodes, Sankhagowit and Tabatowski2017).Footnote 48 For example, Jardine (Reference Jardine2016b) presents a formal and restrictive theory of tone pattern well-formedness by applying a logical characterization of SL (i.e., the conjunction of negative literals, or CNL; see Strother-Garcia et al. Reference Strother-Garcia, Heinz, Hwangbo, Verwer, van Zaanen and Smetsers2016) to autosegmental graphs. Importantly, the use of logic allows the definition of locality to remain fixed while the representation is changed, highlighting the ways in which representation can modulate both perceived and formal pattern complexity.
As for input–output maps, logical characterizations of subregular function classes have been explored in work inspired by Engelfriet and Hoogeboom’s (Reference Engelfriet and Hoogeboom2001) finding that regular functions are equivalent to monadic second-order (MSO) graph interpretations (Enderton Reference Enderton1972).Footnote 49 The restrictions that define the different subregular function classes correspond to restrictions on the logic used for the interpretation. For example, Chandlee and Lindell (Reference Chandlee and Lindell2016) show that as graph interpretations the ISL functions require only quantifier-free (QF) first-order (FO) logic. Chandlee and Jardine (Reference Chandlee and Jardine2019, Reference Chandlee and Jardine2021) use this characterization of locality to define autosegmental input strictly local (AISL) functions for tone processes. An AISL function is a QF graph transduction over autosegmental graphs. They show how AISL enables a formal and nuanced investigation into the conditions under which ARs make a local analysis of long-distance tone processes possible, as some but not all are local over both strings and ARs, while others are local over only strings or only ARs.Footnote 50 Again, what it means to be local here is not impressionistic, but an exact criterion (namely, QF FO).
In addition, graph interpretation has also been a tool for assessing the significance of differences among alternative representations. For example, Strother-Garcia (Reference Strother-Garcia2019) shows that three different syllable representations (i.e., trees, strings labeled with syllable positions, and strings with syllable boundaries marked) are all QF-bi-interpretable, meaning each can be converted into the other with a QF interpretation. Here the use of QF reflects the degree to which the differences between representations are meaningful. The fact that such a limited logic is sufficient for these conversions is taken as evidence that they are essentially notational equivalents. In the same vein, Oakden (Reference Oakden2020) shows that Yip’s (Reference Yip1989) and Bao’s (Reference Bao1990) proposed tonal representations are also QF-bi-interpretable, Jardine et al. (Reference Jardine, Danis and Iacoponi2021) show that there is a FO-definable interpretation between constraints stated in Q-theory (Shih and Inkelas Reference Shih and Inkelas2019) and those stated over ARs, and S. Nelson (Reference Nelson, Ettinger, Hunter and Prickett2022) uses CNL and CPL (conjunction of positive literals) logics to establish the extensional equivalence of various feature systems.
Importantly, while the different characterizations of subregular languages and functions (i.e., finite-state versus logic; see also Lambert Reference Lambert2022 for an algebraic treatment) are based in distinct formalisms that have distinct advantages, they converge to define the exact same classes of objects.Footnote 51 Thinking about these objects in terms of these differing formalisms can only serve to deepen our understanding of the nature of the phonological patterns they represent. As Engelfriet and Hoogeboom (Reference Engelfriet and Hoogeboom2001: 216) write, “It is always a pleasant surprise when two formalisms, introduced with different motivations, turn out to be equally powerful, as this indicates that the underlying concept is a natural one.”
As noted by Heinz (Reference Heinz, Hyman and Plank2018), FLT offers a way to study phonology while being as atheoretical as one can get. This means we can gather insights into what phonology is – including predictions for what it can and cannot do, comparisons among different categories of patterns, and formally grounded criteria for what the relevant categories actually are – without being constrained by the lens of any one theory or formalism. This in turn allows us to uncover the aspects of our theories that do and do not reflect those independently discovered properties of our shared object of study. The finite-state treatments of SPE and OT mentioned previously are an example of this type of reckoning. More recently, work by Lamont (Reference Lamont, Stockwell, O’Leary, Xu and Zhou2019, Reference Lamont2021, Reference Lamont2022) has explored how the ways in which constraint-based grammars over-generate depend on different types of markedness constraints (local/substrings versus global/subsequences) and different versions of optimization (OT versus HS).
8 Conclusion
The variety and volume of work covered in this Element are a testament to how prevalent quantitative and computational approaches to phonology have become, and that trend is likely to not only continue but grow in the years ahead. Researchers focusing on a wide range of puzzles and problems related to the acquisition and representation of phonological knowledge are more fully embracing the value if not the necessity of computational, mathematical, and/or statistical tools in their investigations. In turn, these methods are becoming an increasingly necessary component of the teaching of and training in phonology as a field of study. The kinds of analysis they enable have greatly augmented our capacity for identifying and characterizing phonological patterns and for studying what can be learned under what conditions, from both a formal and an empirical perspective. Lastly, these approaches have equipped phonologists with a range of options for implementing our theories, forcing us to make them more precise and enabling us to better assess what remains to be uncovered with respect to the phonological component of natural languages.
Robert Kennedy
University of California, Santa Barbara
Robert Kennedy is a Senior Lecturer in Linguistics at the University of California, Santa Barbara. His research has focused on segmental and rhythmic alternations in reduplicative phonology, with an emphasis on interactions among stress patterns, morphological structure, and allomorphic phenomena, and socio-phonological variation within and across the vowel systems of varieties of English. His work has appeared in Linguistic Inquiry, Phonology, and American Speech. He is also the author of Phonology: A Coursebook (Cambridge University Press), an introductory textbook for students of phonology.
Patrycja Strycharczuk
University of Manchester
Patrycja Strycharczuk is Senior Lecturer in Linguistics and Quantitative Methods at the University of Manchester. Her research programme is centered on exploring the sound structure of language by using instrumental articulatory data. Her major research projects to date have examined the relationship between phonology and phonetics in the context of laryngeal processes, the morphology–phonetics interactions, and articulatory dynamics as a factor in sound change. The results of these investigations have appeared in journals such as Journal of Phonetics, Laboratory Phonology, and Journal of the Acoustical Society of America. She has received funding from the British Academy and the Arts and Humanities Research Council.
Editorial Board
Diana Archangeli, University of Arizona
Ricardo Bermúdez-Otero, University of Manchester
Jennifer Cole, Northwestern University
Silke Hamann, University of Amsterdam
About the Series
Cambridge Elements in Phonology is an innovative series that presents the growth and trajectory of phonology and its advancements in theory and methods, through an exploration of a wide range of topics, including classical problems in phonology, typological and aerial phenomena, and interfaces and extensions of phonology to neighbouring disciplines.