The use of tonal coarticulation in segmentation of artificial language speech: A study with Mandarin listeners

Zhe-Chen Guo; Shu-Chen Ou

doi:10.1017/S0142716420000818

The use of tonal coarticulation in segmentation of artificial language speech: A study with Mandarin listeners

Published online by Cambridge University Press: 05 January 2021

Zhe-Chen Guo

and

Shu-Chen Ou

Show author details

Zhe-Chen Guo*: Affiliation:
University of Texas at Austin
Shu-Chen Ou: Affiliation:
National Sun Yat-sen University
*: *Corresponding author. E-mail: zcadamguo@utexas.edu.

Article contents

Abstract
Tonal Coarticulation
The Current Study
Experiment 1
Experiment 2
General Discussion
Footnotes
References

Rights & Permissions

Abstract

Tonal carryover assimilation, whereby a tone is assimilated to the preceding one, is conditioned by prosodic boundaries in a way suggesting that its presence may signal continuity or lack of a boundary. Its possibility as a speech segmentation cue was investigated in two artificial language (AL) learning experiments. Mandarin-speaking listeners identified the “words” of a three-tone AL (e.g., [pé.tī.kù]) after listening to six long speech streams in which the words were repeated continuously without pauses. The first experiment revealed that segmentation was disrupted in an “incongruent-cues” condition where tonal carryover assimilation occurred across AL word boundaries and conflicted with statistical regularities in the speech streams. Segmentation was neither facilitated nor inhibited in a “congruent-cues” condition where tonal carryover assimilation occurred only within the AL words in 27% of the repetitions and never across word boundaries. A null effect was again found for the congruent-cues condition of the second experiment, where all AL word repetitions carried tonal carryover assimilation. These findings show that tonal carryover assimilation is exploited to resolve segmentation problems when cues conflict. Its null effect in the congruent-cues conditions might be linked to cue redundancy and suggest that it is weighted low in the segmentation cue hierarchy.

Keywords

artificial language learning carryover assimilation speech segmentation tonal coarticulation

Type: Original Article
Information: Applied Psycholinguistics , Volume 42 , Issue 3 , May 2021 , pp. 631 - 655

DOI: https://doi.org/10.1017/S0142716420000818 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2021. Published by Cambridge University Press

A crucial mental process during spoken language comprehension is the segmentation of continuous speech streams into discrete units such as words. Accumulated evidence has demonstrated that listeners exploit a wide variety of cues to facilitate this process (see, e.g., Cutler, Reference Cutler2012; Davis, Marslen-Wilson, & Gaskell, Reference Davis, Marslen-Wilson and Gaskell2002, for reviews). These range from statistical regularities in the speech input to language-specific phonological patterns and acoustic-phonetic details. This study sets out to expand this line of investigation by experimentally testing whether tonal coarticulation, a low-level acoustic-phonetic phenomenon commonly observed in lexical tone languages, is used in speech segmentation. Before showing how tonal coarticulation could be useful, we present an overview of previous findings as awareness of what cues have been shown to support segmentation is helpful for designing an experiment for our purpose.

Despite its continuous nature, speech contains statistical regularities that provide useful boundary information. These regularities are usually expressed in terms of transitional probability (TP), which captures the likelihood that a pair of elements co-occur. In general, two consecutive syllables with a higher TP tend to be word-internal and are more likely to be perceived as being so, whereas two syllables with a lower TP tend to occur across words and are more likely to be perceived as straddling a boundary (Mirman, Magnuson, Estes, & Dixon, Reference Mirman, Magnuson, Estes and Dixon2008; Saffran, Newport, & Aslin, Reference Saffran, Newport and Aslin1996; Saffran, Newport, Aslin, Tunick, & Barrueco, Reference Saffran, Newport, Aslin, Tunick and Barrueco1997). Such a segmentation solution is not exclusive to adult listeners as young children and even infants are also able to compute TPs to extract possible word forms (Aslin, Saffran, & Newport, Reference Aslin, Saffran and Newport1998; Estes, Evans, Alibali, & Saffran, Reference Estes, Evans, Alibali and Saffran2007; Hay, Pelucchi, Estes, & Saffran, Reference Hay, Pelucchi, Estes and Saffran2011; Saffran et al., Reference Saffran, Newport, Aslin, Tunick and Barrueco1997; Thiessen & Saffran, Reference Thiessen and Saffran2007). Tracking statistical regularities in speech is therefore thought to be an ontologically early segmentation strategy, permitting discovery of potentially meaningful units before the emergence of an adultlike lexicon.

With these statistical computations supporting segmentation from an early age, listeners further develop various language-specific segmentation solutions through increasing experience with native-language phonological patterns. For example, Dutch phonotactics prohibit word-internal [mr] sequences (e.g., *[mrɒk] is not a possible Dutch word) and Dutch listeners hearing these sequences would assume a word boundary between the two consonants (McQueen, Reference McQueen1998). Vowel harmony in Finnish dictates that word-internal vowels agree in frontness/backness (Karlsson, Reference Karlsson1983), inclining Finnish listeners to segment speech in such a way that two syllables belong to a single word when their vowels agree in this feature but to different words when they do not (Suomi, McQueen, & Cutler, Reference Suomi, McQueen and Cutler1997; Vroomen, Tuomainen, & de Gelder, Reference Vroomen, Tuomainen and de Gelder1998). Phonological patterns that promote segmentation solutions may also arise from the distribution of a phonological entity, such as lexical stress. In English, the majority of the words begin with stressed syllables (Cutler & Carter, Reference Cutler and Carter1987); thus, English listeners treat stressed or prominent syllables as word onsets and segment speech accordingly (Cutler, Reference Cutler and Altmann1990; Cutler & Butterfield, Reference Cutler and Butterfield1992; Cutler & Norris, 1998; Tyler & Cutler, Reference Tyler and Cutler2009).

In addition to phonological patterns, fine-grained acoustic–phonetic aspects of speech sounds may also help resolve segmentation problems. English listeners use subtle durational differences in [l] to disambiguate two lips and tulips (Gow & Gordon, Reference Gow and Gordon1995). Besides, due to the prosodic structuring of speech, segments in the initial position of a larger prosodic constituent are produced with stronger articulatory strengthening than those in the initial position of a smaller prosodic constituent (Cho & Keating, Reference Cho and Keating2001; Fougeron & Keating, Reference Fougeron and Keating1997; Keating, Cho, Fougeron, & Hsu, Reference Keating, Cho, Fougeron and Hsu2004). The stronger strengthening effect associated with a larger constituent has been shown to facilitate the search of word onsets (Cho, McQueen, & Cox, Reference Cho, McQueen and Cox2007). Similarly, the degree of coarticulation between segments carries boundary information. Adjacent segments are more strongly coarticulated within words than between words (Byrd, Reference Byrd1996; Byrd & Saltzman, Reference Byrd and Saltzman1998), or across a smaller prosodic boundary than across a larger one (Cho, Reference Cho2004; Fougeron & Keating, Reference Fougeron and Keating1997). Listeners can use such coarticulatory information to segment speech (Fernandes, Kolinsky, & Ventura, Reference Fernandes, Kolinsky and Ventura2010; Fernandes, Ventura, & Kolinsky, Reference Fernandes, Ventura and Kolinsky2007).

Fine-grained acoustic–phonetic details may modulate segmentation behavior to such an extent that they affect or even determine the use of a segmentation strategy that is motivated by a language-specific phonological pattern. Supporting evidence comes from recent studies with the use of tonal or fundamental frequency (F0) cues by Korean and Taiwanese Southern Min (TSM) listeners. Korean is thought to have a prosodic constituent called accentual phrase (AP), which frequently begins with a low (L) tone and ends in a high (H) tone (Jun, Reference Jun1998) and Korean listeners tend to perceive H-L tone sequences as cueing an AP boundary (Kim, Broersma, & Cho, Reference Kim, Broersma and Cho2012; Kim & Cho, Reference Kim and Cho2009). Moreover, Tremblay, Cho, Kim, and Shin (Reference Tremblay, Cho, Kim and Shin2019) suggest that the way in which the tone sequence is acoustic-phonetically realized affects how well it can be exploited for segmentation purposes. They found that Korean listeners’ segmentation improved as the L tone in the tonal sequence became phonetically lower and more closely resembled the canonical realization of the AP-initial L tone in Korean. A related case was reported in Ou and Guo’s (Reference Ou and Guo2019) study with TSM listeners. TSM is a tone language with an extensive tone sandhi process that restricts its only rising tone to the final position of the tone sandhi domain. It may thus be hypothesized that a cue as simple as a final rise in F0 suffices to signal finality for TSM listeners. Yet, given that the domain-final position is associated with phonetic final lengthening, it may be alternatively hypothesized that final lengthening is needed for a final F0 rise to be a sufficient finality cue. Ou and Guo’s findings support the alternative hypothesis: final F0 rise alone did not improve TSM listeners’ segmentation; instead, it was the combination of final F0 rise and final lengthening that did. Taken together, these two studies demonstrate that while some segmentation strategies are shaped by the distributions of phonological entities (e.g., the H-L tone sequence of the AP in Korean and the domain-final rising tone in TSM), they do not abstract away from the fine details of how those entities are acoustic-phonetically manifested.

To sum up, the literature has revealed at least three types of speech segmentation cues: (a) statistical regularities, (b) native-language phonological patterns, and (c) fine-grained acoustic–phonetic details. It is also found that acoustic–phonetic details may impact the use of phonological patterns, suggesting that they play a nontrivial role in shaping segmentation behavior. Research on acoustic–phonetic cues could thus contribute insight into how sensitive and resourceful listeners are in solving segmentation problems, a question that may inform theories and models seeking to reveal what cues are useful and how they are integrated (e.g., Mattys, White, & Melhorn, Reference Mattys, White and Melhorn2005). Yet, empirical work to date on this type of cues is mostly concerned with those at the segmental level (e.g., coarticulation of segments). Less attention has been paid to whether fine-grained tonal or F0 information is exploited. Perhaps the study that bears most on this question is that by Tremblay et al. (Reference Tremblay, Cho, Kim and Shin2019) discussed above. Nevertheless, while Tremblay et al. show that Korean listeners’ use of the H-L tone sequence was affected by the phonetic manifestation of the L tone, the segmentation strategy that the listeners employed is phonologically motivated in nature. Subtle acoustic–phonetic changes to the scaling of the L tone do result in better use of the segmentation strategy if they create a more canonical realization of the tone, but they are not what lead Korean listeners to develop the strategy in the first place. However, there are detailed acoustic–phonetic tonal phenomena in speech that do not seem to depend on a particular phonological entity and can potentially promote segmentation strategies on their own. One of them is tonal coarticulation. In this study, we focused on a specific type of tonal coarticulation and investigated its effect on speech segmentation. Below is an introduction to lexical tones and tonal coarticulation.

Tonal Coarticulation

Lexical tones are pitch patterns over a syllable that serve to differentiate word meanings. For example, in Mandarin, [ma] means “mother” when bearing a high-level tone (Tone 1) but “hemp” when bearing a rising tone (Tone 2). The primary acoustic correlate of lexical tones is F0, and the F0 realizations of one tone in connected speech may vary under the contextual influence of adjacent tones, resulting in the so-called tonal coarticulation. Production experiments have revealed much evidence for tonal coarticulation in Mandarin and other lexical tone languages (see, e.g., Chen, Reference Chen, Cohn, Fougeron and Huffman2012; Xu, Reference Xu2001, for a review). Recently, Hao, Zhang, Xie, and Zhang (Reference Hao, Zhang, Xie and Zhang2018) proposed a scheme for annotating tonal coarticulation and applied it to speech samples from a Mandarin corpus. About 51% of bitonal syllable sequences in their data were labeled as being tonally coarticulated, suggesting that tonal coarticulation is prevalent in connected speech, at least in Mandarin. The exact effect of one tone on another is commonly described in terms of (a) whether it is assimilatory or dissimilatory and (b) whether its direction is anticipatory or carryover (e.g., Brunelle, Reference Brunelle2009; Chang & Hsieh, Reference Chang and Hsieh2012; Cheng, Reference Cheng1968; Peng, Reference Peng1997; Potisuk, Gandour, & Harper, Reference Potisuk, Gandour and Harper1997; Shen, Reference Shen1990; Shih, Reference Shih1988; Xu, Reference Xu1994; Zhang & Liu, Reference Zhang and Liu2011). A logical corollary of this is that there are four theoretically possible types of tonal coarticulation: carryover assimilation, carryover dissimilation, anticipatory assimilation, and anticipatory dissimilation. As an example, the right panel of Figure 1 is a schematic illustration of carryover assimilation, whereby the F0 contour of a tone is partially assimilated to the preceding tone.

Figure 1. Possible transitions between a high-level tone and a rising tone (adapted from Xu, Reference Xu1997, p. 63). The left figure represents a situation in which there is no tonal coarticulation. The right one illustrates tonal carryover assimilation, whereby the initial portion of the rising tone’s F0 contour changes into a falling F0 transition due to assimilation to the preceding high-level tone.

Among the possible types of tonal coarticulation, carryover assimilation is of particular interest to the current investigation for two reasons. First, cross-linguistically, carryover effects are generally assimilatory, as evidenced by the fact that carryover assimilation is attested in a broad range of lexical tone languages, including Mandarin (Shih, Reference Shih1988; Xu, Reference Xu1997), Taiwanese (Cheng, Reference Cheng1968; Wang, Reference Wang2002), Tianjin Chinese (Zhang & Liu, Reference Zhang and Liu2011), Thai (Gandour, Potisuk, & Dechongkit, Reference Gandour, Potisuk and Dechongkit1994), Cantonese (Li, Lee, & Qian, Reference Li, Lee and Qian2004), Vietnamese (Brunelle, Reference Brunelle2009), and so on. In contrast, findings on whether anticipatory effects are assimilatory or dissimilatory are relatively mixed as these effects are reported to vary across languages or even across different tones of the same language (Zhang & Liu, Reference Zhang and Liu2011). Second, and more importantly, it has been shown that as with segmental coarticulation, tonal carryover assimilation is conditioned by prosodic boundary strength. In Mandarin, it tends to be stronger when two adjacent tones span the boundary of a smaller prosodic unit than when they span that of a larger unit (Lai & Kuang, Reference Lai and Kuang2016). Such a tonal coarticulatory effect may even be completely eliminated when the neighboring syllables straddle a major prosodic break (Zhang & Kawanami, Reference Zhang and Kawanami1999). These suggest that tonal carryover assimilation may be useful for segmenting continuous speech into discrete units.

The Current Study

The goal of the present work is to experimentally test this possibility. As with many empirical studies, we attempt to draw conclusions about a single cue, which, in our case, is tonal carryover assimilation. Nevertheless, cues can rarely be isolated from each other even in well-controlled laboratory speech materials. For example, while Fernandes et al.’s (Reference Fernandes, Ventura and Kolinsky2007) listeners exploited segmental coarticulation in segmenting an artificial language (AL), TP information was always present in the stimuli. It is therefore instructive to consider two possible scenarios for segmentation in the presence of multiple cues as the considerations could inform the experimental design and result interpretation. One scenario is that the cues operate in cooperation. A possible outcome is that their effects are additive or even synergistic, enhancing segmentation to a greater extent than a single cue alone does. An example is Fernandes et al., in which segmentation was better when segmental coarticulation and TP information were present and congruent with each other than when TPs were the sole cue. Yet, cooperating cues may be redundant and may not produce extra facilitation. Bagou and Frauenfelder (Reference Bagou and Frauenfelder2018) and Kim et al. (Reference Kim, Broersma and Cho2012) showed that although French and Korean listeners benefited from final lengthening and final F0 rise in isolation, conjoining these two prosodic cues does not improve their performance further. The other scenario is when cues operate in conflict and there are again two possible outcomes. One is that the conflict leads to inhibition. For example, Ordin, Polyanskaya, Laka, and Nespor (Reference Ordin, Polyanskaya, Laka and Nespor2017) found that Italian listeners used vowel lengthening to locate word-medial positions; therefore, when it was the vowels in the word-initial positions that were lengthened, TP-based segmentation was disrupted, reducing performance to a level worse than that of a condition with TPs as the only cue. Alternatively, the cue conflict may neither facilitate nor inhibit segmentation.

With these two scenarios in mind, we investigated the use of tonal carryover assimilation with the AL learning technique. It is an experimental paradigm that has been widely adopted to explore how phonological patterns and acoustic–phonetic details guide segmentation (e.g., Bagou & Frauenfelder, Reference Bagou and Frauenfelder2018; Fernandes et al., Reference Fernandes, Ventura and Kolinsky2007; Kim et al., Reference Kim, Broersma and Cho2012; Ordin & Nespor, Reference Ordin and Nespor2016; Ordin et al., Reference Ordin, Polyanskaya, Laka and Nespor2017; Toro, Pons, Bion, & Sebastián-Gallés, Reference Toro, Pons, Bion and Sebastián-Gallés2011; Tremblay et al., Reference Tremblay, Cho, Kim and Shin2019; Tyler & Cutler, Reference Tyler and Cutler2009). A typical AL learning experiment has a learning (or exposure) phase followed by a test phase. In the learning phase, participants learn an AL by listening to long speech streams in which tokens of the “words” of the AL, which are meaningless syllable sequences, are concatenated without pauses in between. The basic cue for word segmentation is TP. For example, as mentioned, adjacent syllables with a lower TP are more likely to span a boundary. On top of TPs, additional cues may be introduced to examine how they impact segmentation. After listening to the speech streams, participants complete a two-alternative forced-choice test in which they hear a word of the AL and a sequence that is not part of the AL vocabulary and have to select the former. The proportion of correct selections is thought to reflect how well the AL speech streams were segmented during the learning.

Such an experiment has two important advantages for research concerned with phonological or acoustic–phonetic cues. First, as it has been suggested that segmentation is primarily lexically driven (Mattys & Bortfeld, Reference Mattys, Bortfeld, Gaskell and Mirković2017; Mattys et al., Reference Mattys, White and Melhorn2005), using nonsense speech prevents listeners from segmenting based on lexical knowledge (e.g., by lexical subtraction: White, Melhorn, & Mattys, Reference White, Melhorn and Mattys2010) and allows the researcher to obviate confounds such as word frequency. Second, with artificial speech, one can precisely control the acoustic–phonetic content of the additional non-TP cues. Two AL learning experiments were conducted in this study. Although their focus is on tonal carryover assimilation, as noted above, it is useful to also consider its possible effects in the presence of TP information, the basic segmentation cue in an AL learning task. We thus constructed conditions corresponding to the two scenarios discussed. Details about the design and the hypotheses to test are presented below.

Experiment 1

Participants

Ninety-six adult native speakers of Mandarin (28 males and 68 females) with no self-reported history of hearing impairments were recruited from a university in Southern Taiwan. Their mean age in years was 20.3 (range: 18–23; standard deviation: 1.4). They had been learning English as a compulsory subject in school, and 9 of them had received musical training.^{Footnote 1} They were randomly assigned to one of the three experimental conditions (see the Design and stimuli section). The single-cue, congruent-cues, and incongruent-cues conditions had 31, 33, and 32 listeners, respectively.

Design and stimuli

In the learning phase, participants were exposed to a nonsense tonal language under three conditions, the design of which was modeled after Fernandes et al.’s (Reference Fernandes, Ventura and Kolinsky2007) study with segmental coarticulation. One was called the “single-cue” condition, in which participants could only rely on TPs to segment the AL speech streams. In addition to TPs, tonal carryover assimilation was introduced to the speech streams in the other two conditions. In the “congruent-cues” condition, tonal carryover assimilation occasionally occurred between the syllables within an AL word but never across word boundaries. Therefore, the tonal coarticulatory cue agreed with the TP cue. In the “incongruent-cues” condition, however, these cues were pitted against each other by letting tonal carryover assimilation occur across AL word boundaries. The congruent-cues and incongruent-cues conditions corresponded to the situation in which the cues are in harmony and the situation in which they are in conflict, respectively.

As with several studies (e.g., Kim et al., Reference Kim, Broersma and Cho2012; Ordin et al., Reference Ordin, Polyanskaya, Laka and Nespor2017; Saffran et al., Reference Saffran, Newport and Aslin1996; Vroomen et al., Reference Vroomen, Tuomainen and de Gelder1998), we created an AL consisting of six trisyllabic words, which were meaningless sequences of consonant-vowel syllables, as listed in the second column of Table 1. The words were formed by four vowels ([a, i, u, e]), three consonants ([p, t, k]), and three level tones (high-, mid-, and low-level tones). These consonants and vowels are cross-linguistically common and occur in Mandarin at least at the phonetic level. Irrespective of the tones, all the used consonant–vowel syllables except for [ki] are phonotactically possible in Mandarin. One constraint imposed during the construction of the AL lexicon was that adjacent syllables in a word had to differ by one tone level. For example, a high-level tone could only be preceded and followed by a mid-level tone, not by itself or by the low-level tone. Thus, there could be only six tone patterns, as shown in the first column of Table 1. In these patterns, neighboring tones always had different tone heights, creating what is referred to in the literature as “conflicting tonal contexts” (e.g., Peng, Reference Peng1997; Xu, Reference Xu1994) and allowing us to implement tonal carryover assimilation for every two adjacent tones in an AL word (and also a partword, described below). The fact that adjacent tones differed by one tone level also enabled us to control for the magnitude of carryover assimilation, so that, for instance, there was no carryover assimilation between the high-level and low-level tones, which would span a wider F0 range than that between the middle-level and low-level tones.

Table 1. Words of the artificial language (AL) and partwords

Note: The acute (´), macron (¯), and grave (`) marks represent high-level, mid-level, and low-level tones, respectively. The dots indicate syllable boundaries.

The syllables making up the AL words were individually inserted into a carrier sentence and read by a male native speaker of Mandarin with phonetic training in a monotone into a Zoom H4n Handy Recorder. The recorded items were digitized at a sampling rate of 44.1 kHz and stored as a single WAV file. The syllables were excised from the carrier sentence and then underwent manipulations using Praat (Boersma & Weenink, Reference Boersma and Weenink2018). Their root-mean-squared amplitudes were equalized and their durations were normalized to 335 ms, which was the mean duration of the original, unmanipulated syllables. Next, their F0 contours were flattened, set to 126 Hz (the average F0 of the syllables prior to the manipulations), and resynthesized using the overlap-add method in Praat. The resulting flat F0 contour served as the mid-level tone. Following Caldwell-Harris, Lancaster, Ladd, Dediu, and Christiansen (Reference Caldwell-Harris, Lancaster, Ladd, Dediu and Christiansen2015), we created the high-level and low-level tones by shifting the F0 contour up and down, respectively, by 3.5 semitones. The manipulated syllables were concatenated to form the six AL words.

In addition, six “partwords,” which are listed in the third column of Table 1, were created using the same set of manipulated syllables. They served as the distractor stimuli in the test phase and were trisyllabic sequences derived by combining the last syllable of an AL word with the first two syllables of another AL word, or by combining the last two syllables of an AL word with the first syllable of another AL word. They were constructed under the same constraint for the AL words and therefore had the same six tone patterns. This prevented participants from being able to easily reject the partwords by identifying novel tone patterns.

The learning-phase stimuli were six speech streams in which tokens of the AL six words were concatenated with no pauses in between. Each stream contained a total of 120 tokens, 20 for each AL word. These tokens appeared in a random order but under the restriction that the same word did not occur twice in a row. As in Tyler and Cutler (Reference Tyler and Cutler2009), the first and last 5 s of each stream were faded in and out to prevent participants from hearing the syllables at the beginning and the end of the stream and using them to discover word boundaries. Each stream was 2 min long and the total duration of the six streams (and hence the learning phase) was about 12 min. The TP for each pair of adjacent syllables AB in the streams was calculated using the formula proposed in Saffran et al. (Reference Saffran, Newport and Aslin1996); that is, it is equal to the frequency of AB divided by the frequency of A. For each AL word or partword, an average TP was computed by taking the average of the TP between the first and second syllables and that between the second and third syllables. The average TPs for the AL words ranged between 0.75 and 1.00 (mean: 0.88) and those for the partwords ranged between 0.32 and 0.62 (mean: 0.49). The six speech streams differed from each other in the order in which the AL words appeared but were the same across the three conditions except for the presence or absence of tonal carryover assimilation and the way in which it was introduced. In the single-cue condition, there was no carryover assimilatory effect from one tone on the next tone and listeners could only achieve segmentation by tracking TPs.

In the congruent-cues and incongruent-cues conditions, tonal carryover assimilation was introduced as an additional cue. In each stream, a fixed number of trisyllabic sequences was selected to receive tonal carryover assimilation. In the incongruent-cues condition, they corresponded to all instances (i.e., 100%) of the partwords in the speech stream. However, the partwords occurred only incidentally and made up just about 27% of the syllables in the stream. To ensure that the incongruent-cues and congruent-cues conditions differed only in the alignment of the tonal coarticulatory cue with word boundaries but not in the number of the syllables with the cue (as in Fernandes et al., Reference Fernandes, Ventura and Kolinsky2007), only 27% of the tokens of the AL words in the congruent-cues condition were selected as the trisyllabic sequences that would receive tonal carryover assimilation. This tonal cue was implemented by performing the following F0 manipulations on the selected trisyllabic sequences. First, the F0 onsets of the second and third syllables were raised or lowered to the F0 offsets of their immediately preceding syllables (i.e., the first and second syllables, respectively). Second, between the shifted F0 onset and the 25% time point into the F0 contour of the second or third syllable, a smooth F0 transition was interpolated quadratically using Praat. The erstwhile flat F0 contour of a level tone then had an F0 rise over the initial quarter of its F0 contour if its preceding tone was lower and an F0 fall if its preceding tone was higher. Note that in actual Mandarin tone production, the assimilatory effect exerted by the preceding tone can be far more extensive: for example, it may still be evident even at the 75% time point of the next tone (e.g., Xu, Reference Xu1997). Manipulating only the initial 25% of the F0 contour of a syllable allowed us to evaluate the influence of tonal carryover assimilation conservatively. Shown in Figure 2 are samples of speech stream from each condition.

Figure 2. Samples of learning-phase speech streams under the single-cue, congruent-cues, and incongruent-cues conditions. The dashed lines indicate word boundaries.

The test phase consisted of a two-alternative forced-choice test. In each trial, two stimuli (a word of the AL and a partword) were presented successively with 500 ms of silence in between. The stimuli did not have the tonal carryover assimilation cue and were the same for all conditions. Therefore, the conditions differed only in the learning phase, specifically, in whether and how carryover assimilation was introduced to the speech streams. The orders in which the AL word and partword were presented in a trial were counterbalanced. There were 36 trials in total, yielded by pairing the six AL words exhaustively with the six partwords. E-prime 2.0 software (Psychology Software Tools, 2012) was used to control stimulus presentation and record responses.

Procedure

Participants were tested individually in front of a desktop computer in a sound-attenuated booth. They were told to learn an AL by listening to six prerecorded sound files of that language (i.e., the six learning-phase speech streams). They were not given any cues such as the length or number of the words in the AL. They were instructed to pay as much attention as possible to what they heard and made aware of an upcoming test that would assess their knowledge of the AL. They were allowed to take a short break after finishing listening to each sound file. After the learning phase, they immediately proceeded to the test, in which they heard two stimuli in a row in each trial. They were asked to select the one that they thought was a word of the AL by pressing the button on a response box that corresponded to the order of presentation of the word (i.e., button “1” or “2”). There was a 10-s response timeout after the second stimulus. Participants first completed three practice trials presenting nonsense syllable sequences not used in the AL to familiarize themselves with the procedure. They completed the practice by arbitrarily pressing any button on the response box but were reminded that they had to choose the AL words in the test proper.

Hypotheses and predictions

The single-cue condition served as the baseline for comparison with the other two conditions, based on which hypotheses regarding the effect of tonal carryover assimilation were tested. For the congruent-cues condition, the hypothesis was that tonal carryover assimilation in agreement with TPs contributes to segmentation above and beyond TPs. This predicted that listeners’ selection accuracy in the test would be significantly higher in the congruent-cues condition than in the single-cue one (as in Fernandes et al., Reference Fernandes, Ventura and Kolinsky2007). As for the incongruent-cues condition, where the partwords received tonal carryover assimilation, the hypothesis of interest was that the conflict between the tonal cue and TPs would impede segmentation. Should this be the case, the listeners in the incongruent-cues condition would respond significantly less accurately compared with the single-cue one (much in the same way as the Italian listeners exposed to initial lengthening in Ordin et al., Reference Ordin, Polyanskaya, Laka and Nespor2017). Findings lending support for the facilitation in the congruent-cues condition or the inhibition in the incongruent-cues condition may be interpreted as evidence for the use of tonal carryover assimilation.

Results and discussion of Experiment 1

The listeners’ responses in the test were analyzed. Timeouts (i.e., no responses within 10 s) accounted for about 0.43% of all observations and were excluded. In the remaining data, a response was coded as correct when the AL word in the trial was selected and as incorrect when the partword was selected. Displayed in Figure 3 are the mean percentage of correct responses of each condition along with those of individual participants. A linear mixed-effects logistic regression model was fitted to the data by using the glmer() function from the lme4 package (Bates, Mächler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2015) in R (R Core Team, 2017) to examine the effects of the experimental conditions. The dependent variable was the response to each test trial, which was either correct or incorrect. The fixed effect of central interest was condition, with the single-cue condition being the baseline level. Two additional predictors were also entered as fixed effects to partial out their impact on responses. First, the order in which a given trial appeared in the test (trial) was included to control for possible fatigue or practice effects. Second, as recommended in Ou and Guo (Reference Ou and Guo2019), the log-transformed reaction time (LogRT) was included to capture any potential trade-offs between response accuracy and latency. Both trial and LogRT were centered and scaled. All the fixed-effect predictors were entered as main effects only. Following Barr, Levy, Scheepers, and Tily’s (Reference Barr, Levy, Scheepers and Tily2013) recommendation, we used the maximal converging random-effects structure supported by the data. For Experiment 1, the random-effects structure consisted of a by-participant random intercept and a by-item random intercept for partwords.

Figure 3. Percentages of correct responses of individual participants (empty circles) and the means of the single-cue, congruent-cues, and incongruent-cues conditions (filled circles). The bars represent 95% confidence intervals.

Table 2 shows the results of the mixed-effects model. Trial was significant, with responses in later trials being less accurate than those in earlier trials, possibly due to fatigue. LogRT was significant as well, indicating an inverse correlation between response accuracy and latency (i.e., faster responses were more likely to be correct than slower ones). As suggested in Ou and Guo (Reference Ou and Guo2019), this correlation might be merely an artifact of response certainty: as the two stimuli were separated by 500 ms of silence, listeners might be ready to respond if they were sure that the first stimulus was an AL word or a partword. Most importantly, there was a significant main effect of condition, which indicated that response accuracy in the incongruent-cues condition (mean: 57.58%) was generally lower than that in the single-cue one (mean: 63.38%). Therefore, pitting the tonal carryover assimilation cue against TP information (by letting the cue-bearing sequences span word boundaries) hinders segmentation, reducing the listeners’ performance to a level below that of a condition where TPs are the sole cue. However, the other main-effect term of condition showed that response accuracy in the congruent-cues condition (mean: 65.39%), where tonal carryover assimilation occurred within word boundaries, was not significantly different from that of the single-cue one. There is thus no evidence that the listeners’ segmentation is better or worse when the TP and tonal carryover assimilation cues occur in tandem and in a cooperative manner than when TP information is the only cue.

Table 2. Mixed-effects results of Experiment 1

Experiment 1 examined the use of tonal carryover assimilation by Mandarin listeners in speech segmentation with an AL learning task. One proposed hypothesis predicted that relative to that of the single-cue condition, the response accuracy of the incongruent-cues condition would be significantly lower and this prediction was borne out. The finding is consistent with Fernandes et al.’s (Reference Fernandes, Ventura and Kolinsky2007) study with segmental coarticulation, in which segmentation in the incongruent-cues condition was worse than in the single-cue one. Analogous results have also been reported by AL learning research demonstrating that compared with a TP-only condition, segmentation is disrupted when an additional prosodic cue appears in a position that is unexpected in view of phonological patterns in the listeners’ native language (e.g., Ordin et al., Reference Ordin, Polyanskaya, Laka and Nespor2017). The Mandarin listeners’ disrupted segmentation performance in the incongruent-cues condition of the present study may be interpreted as reflecting their attempts to use the tonal assimilation cue, even though such use is in conflict with TP regularities. Such disruption of segmentation should not be possible if the cue had not been exploited at all.

With regard to the congruent-cues condition, we hypothesized that the congruent tonal carryover assimilation cue would facilitate segmentation above and beyond the effects of TPs. The results did not support the hypothesis as there was no significant difference in response accuracy between the single-cue and congruent-cues conditions. This is not consistent with Fernandes et al. (Reference Fernandes, Ventura and Kolinsky2007), who did find that with segmental coarticulation as the additional non-TP cue, segmentation under the congruent-cues condition was clearly better than under the single-cue one. Rather, it seems compatible with an alternative view: adding congruent tonal carryover assimilation would not facilitate segmentation because word boundaries are redundantly cued by the tonal and statistical information. Higher TPs between adjacent syllables within an AL word already signal that a boundary between them is unlikely and there might be no need for conveying similar information via tonal coarticulation. Such a cue redundancy hypothesis may lead one to expect no significant difference in the listeners’ accuracy between the single-cue and congruent-cues conditions (as in the case with the final lengthening and F0 rise in Bagou & Frauenfelder, Reference Bagou and Frauenfelder2018; Kim et al., Reference Kim, Broersma and Cho2012).

However, the lack of a significant effect of cue congruence could potentially be attributed to a confounding factor: cue reliability. Recall that only 27% of the AL word tokens in the congruent-cues condition carried tonal carryover assimilation. This percentage was rather low considering the percentage of bitonal sequences that showed a tonal coarticulatory effect (i.e., about 51%) as reported by Hao et al. (Reference Hao, Zhang, Xie and Zhang2018) for Mandarin corpus speech. Therefore, an alternative view was that our listeners were unable to benefit from tonal carryover assimilation under the congruent-cues condition simply because it was not reliably present, not because it was redundant with TP information. Specifically, as cue reliability is associated with cue weight (e.g., Mattys et al., Reference Mattys, White and Melhorn2005; Seidl, Reference Seidl2007; Tremblay, Spinelli, Coughlin, & Namjoshi, Reference Tremblay, Spinelli, Coughlin and Namjoshi2018), they might have allocated a relatively low weight to the unreliably present tonal carryover assimilation cue. This is not entirely impossible given that, however prevalent tonal coarticulation is in Mandarin, the listeners were instructed to learn a completely novel AL and they might develop a cue hierarchy for that AL, one in which tonal carryover assimilation was so lowly weighted that it failed to produce any extra gain in segmentation performance.

The main goal of Experiment 2 is then to test this cue reliability hypothesis, that is, to examine whether enhancing the reliability of the tonal carryover assimilation cue in the congruent-cues condition would facilitate segmentation and provide an alternative explanation for the null effect in Experiment 1. Currently, it is unclear as to how many cue-bearing tokens would count as sufficient for obtaining the kind of facilitation effects reported by Fernandes et al. (Reference Fernandes, Ventura and Kolinsky2007), who did not provide the exact percentage of segmentally coarticulated tokens in their congruent-cues condition. Yet, it would be insightful to test an experimental condition in which the reliability of the tonal carryover assimilation cue is maximized, namely, one in which all the tokens of the AL words receive the cue. Such a condition was included in Experiment 2. In addition, Experiment 2 included the same single-cue and incongruent-cues conditions from Experiment 1. This was done to ensure that participants had been randomly assigned to different conditions while the conditions were being compared and to examine whether the findings of Experiment 1 could be replicated.