Researchers estimate that more than half of the global population speaks more than one language (i.e., Bialystok, Craik & Luk, Reference Bialystok, Craik and Luk2012; Grosjean, Reference Grosjean2010; Romaine, Reference Romaine, Bhatia and Ritchie2012), and when learning starts at an early age, native or near-native proficiency in comprehension and production in both first and second languages is commonly achieved (Perani, Abutalebi, Paulesu, Brambati, Scifo, Cappa & Fazio, Reference Perani, Abutalebi, Paulesu, Brambati, Scifo, Cappa and Fazio2003). Proficient bilingual and multilingual speakers switch between languages in certain conversational contexts, with little apparent difficulty (Fricke & Kootstra, Reference Fricke and Kootstra2016; Toribio, Reference Toribio2004). Sometimes the use of each language is asymmetric between speakers. For instance, in conversations between older and younger members of immigrant families, the elders may prefer to speak their heritage language whereas the younger generation may speak the language of their new community (e.g., Park & Sarkar, Reference Park and Sarkar2007). Communicating in two different languages concurrently is a relatively normal mode of bilingual communication known as lingua receptiva (Bahtina-Jantsikene & Backus, Reference Bahtina-Jantsikene and Backus2016; Bahtina, ten Thije & Wijnen, Reference Bahtina, ten Thije and Wijnen2013; ten Thije, Gooskens, Daems, Cornips & Smits, Reference ten Thije, Gooskens, Daems, Cornips and Smits2017; ten Thije, Reference ten Thije2013).
The prevalence of lingua receptiva raises the question of whether principles of language interaction in monolingual speech may apply to bilingual interactions as well. One of the most well-established principles of monolingual interaction is speech alignment (Pickering & Garrod, Reference Pickering and Garrod2004), which generally refers to the matching of speech produced with speech perceived across multiple levels of representation. For instance, in a conversation between two people, one speaker may use the word “loafer” to refer to a shoe, and the other person may then use the same word “loafer” instead of, say, a “dress shoe” (Brennan & Clark, Reference Brennan and Clark1996). Convergence can also occur in the phonetic features of speech (Kim, Horton & Bradlow, Reference Kim, Horton and Bradlow2011; Pardo, Reference Pardo2006), syntactic structures (Bock, Reference Bock1986), and prosody (Abney, Paxton, Dale & Kello, Reference Abney, Paxton, Dale and Kello2014; De Looze, Scherer, Vaughan & Campbell, Reference De Looze, Scherer, Vaughan and Campbell2014; Xia, Levitan & Hirschberg, Reference Xia, Levitan and Hirschberg2014).
In the present study, we investigate speech convergence in open-ended conversations using two measures of matching that can be applied within one language, as well as across two different languages. Conversations were spoken in English, Spanish, or in a “Mixed” condition where one person spoke English and the other Spanish. The study of inter-language speech convergence in conversations is challenging because there may be no direct correspondences between the surface forms of linguistic units or features produced in two different languages, like English and Spanish. We use complexity matching (Abney, Kello & Warlaumont, Reference Abney, Kello and Warlaumont2015; Abney et al., Reference Abney, Paxton, Dale and Kello2014) as a recent measure of speech convergence that can be directly applied to acoustic speech signals without linguistic coding. It is therefore equally applicable to measuring convergence of speech signals in the same or different languages. We also use a measure of lexical matching (Brennan & Clark, Reference Brennan and Clark1996; Brennan, Kuhlen & Charoy, Reference Brennan, Kuhlen, Charoy, Wixted and Thompson-Schill2018; Garrod & Anderson, Reference Garrod and Anderson1987; Niederhoffer & Pennebaker, Reference Niederhoffer and Pennebaker2002) that can be applied to open-ended conversations without one-to-one alignment of words, although it does require some degree of translation for estimating semantic correspondences of lemmas used across languages.
Our goal was to measure matching within one language and compare it with matching across two languages, using an acoustic measure of the speech signals produced, compared with a linguistic measure of the underlying language interaction. Prior studies lead us to predict that both kinds of matching should occur in purely Spanish conversations and purely English conversations, provided that speakers are proficient enough with the language. It is an open question whether conversations across languages might exhibit weaker signs of matching due to translation, or whether matching is more basic to spoken interactions as a form of convergence during communicative coordination.
In the next section, we review prior studies of inter- and intra-language convergence, pointing out the need for measures spanning different levels of analysis that may be applied more generally to open-ended conversations. We then introduce and explain complexity matching and our measure of lexical matching, where the latter is based on overlap in the frequency distributions of lemmas underlying word tokens. Our experiment follows, and we end with a discussion of the implications of our results for theories of bilingualism, language interaction, and convergence.
Monolingual and bilingual speech convergence
Whether speaking in either inter- or intra-language settings, conversations are coordinated interactions where speakers variably lead, follow, and echo each other as they exchange ideas. Convergent coordination has been observed in multiple aspects of speech and language. In terms of sound, phonetic features, such as vowel quality and voice onset time, become more similar between conversational partners. Phonetic convergence has been observed in monolingual conversations (Pardo, Reference Pardo2013; Pardo, Reference Pardo2006; Pardo, Gibbons, Suppes & Krauss, Reference Pardo, Gibbons, Suppes and Krauss2012) and has been indirectly explored in at least one bilingual study as well (Sancier & Fowler, Reference Sancier and Fowler1997). Phonetic convergence in monolingual conversations was measured by Pardo (Reference Pardo2006) and Pardo et al. (Reference Pardo, Gibbons, Suppes and Krauss2012) by having listeners judge the similarity of speakers’ phoneme production. Using a more indirect method, Nielsen (Reference Nielsen2011) demonstrated that phonetic convergence depends on factors such as word frequency and voice onset time (VOT).
Building on Nielsen (Reference Nielsen2011), convergence in VOT has also been used to measure bilingual phonetic imitation (Balukas & Koops, Reference Balukas and Koops2015; Tobin, Nam & Fowler, Reference Tobin, Nam and Fowler2017). For example, Balukas and Koops (Reference Balukas and Koops2015) analyzed words spoken in conversational interviews by New Mexico Spanish–English bilinguals. The words analyzed contained an initial /p/, /t/, or /k/ sound in both Spanish and English. Participants speaking Spanish (the first language for most participants) were found to have VOT values within the normal range for monolingual Spanish. However, when speaking English, VOTs fell within the low range of monolingual English. Therefore, participants appeared to “bend” phonemes of their non-dominant language towards the dominant language near code-switching events.
At more grammatical levels of language processing, syntactic priming is a form of conversational coordination where speakers tend to produce (Bock, Reference Bock1986; Healey, Purver & Howes, Reference Healey, Purver and Howes2014) or comprehend (Branigan, Pickering, Liversedge, Stewart & Urbach, Reference Branigan, Pickering, Liversedge, Stewart and Urbach1995) new sentences using syntactic structures recently produced or heard. Syntactic priming is well-established in monolingual conversations (Hardy, Messenger & Maylor, Reference Hardy, Messenger and Maylor2017), and bilingual conversations as well because languages often share common syntactic constructions (Hartsuiker, Pickering & Veltkamp, Reference Hartsuiker, Pickering and Veltkamp2004), making it relatively easy to apply measures of syntactic priming in both monolingual and bilingual studies.
While researchers have found evidence for syntactic priming across languages, the language tasks used are often contrived and sometimes require confederates (Fleischer, Pickering & McLean, Reference Fleischer, Pickering and McLean2012; Hartsuiker et al., Reference Hartsuiker, Pickering and Veltkamp2004). For instance, Hartsuiker et al. (Reference Hartsuiker, Pickering and Veltkamp2004) had Spanish–English bilingual participants talk about cards in English with a bilingual confederate who spoke in Spanish. Participants who heard a passive sentence in Spanish were more likely to respond using a passive sentence in English. The authors concluded that these findings provide support for the integration of syntactic representations between the two languages. Kantola and van Gompel (Reference Kantola and van Gompel2011) also found syntactic priming in Swedish–English bilinguals, and recent corpus-based studies on bilingual syntactic priming have provided a more naturalistic source of evidence (Gries & Koostra, Reference Gries and Koostra2017).
Speech convergence has been theorized to be beneficial to language interactions by helping to establish common ground (Brennan & Clark, Reference Brennan and Clark1996), affiliation (Manson, Bryant, Gervais & Kline, Reference Manson, Bryant, Gervais and Kline2013; Pardo et al., Reference Pardo, Gibbons, Suppes and Krauss2012), or comprehension (Branigan et al., Reference Branigan, Pickering, Liversedge, Stewart and Urbach1995; Schober & Clark, Reference Schober and Clark1989). Convergence may stem from domain-general processes of imitation (De Looze, Oertel, Rauzy & Campbell, Reference De Looze, Oertel, Rauzy and Campbell2011; van Baaren, Holland, Steenaert & van Knippenberg, Reference van Baaren, Holland, Steenaert and van Knippenberg2003), possibly implemented through links between speech and language perception and production (Buchsbaum, Gregory & Colin, Reference Buchsbaum, Gregory and Colin2001; Tian & Poeppel, Reference Tian and Poeppel2012), or on-the-fly processes and representations that arise to support coordination and understanding (Brennan & Hanna, Reference Brennan and Hanna2009).
Findings of convergence in bilingual speakers suggest that similar mechanisms of matching are at play in both bilingual and monolingual conversations (Fricke & Kootstra, Reference Fricke and Kootstra2016; Kroll, Dussias, Bogulski & Kroff, Reference Kroll, Dussias, Bogulski and Kroff2012), which may or may not be symmetric between two given languages spoken (Blumenfeld & Marian, Reference Blumenfeld and Marian2007). The rationale for similar mechanisms can be explained in terms of the theory of interactive alignment (Garrod & Pickering, Reference Garrod and Pickering2004; Pickering & Garrod, Reference Pickering and Garrod2004) in which convergence stems from the alignment of language representations across multiple levels of processing that interact with each other. To the extent that convergence is found across two languages, interactive alignment entails that representations from different languages must be aligned. Representational structures may have direct correspondences across languages in some cases, such as shared discourse processes stemming from one culture with multiple languages. In other cases, alignment may require some degree of translation, as between Spanish and English lexicons. Costa, Pickering, and Sorace (Reference Costa, Pickering and Sorace2008) note that convergence may be weaker to the extent that speakers are less proficient in the language spoken, and one might imagine a similar weakening if speakers need to maintain activation of two languages simultaneously during a conversation.
The present study further investigates inter-language speech convergence by adopting and applying complexity matching and lexical matching to both intra- and inter-language conversations. In the next two sections, we explain these two measures and how they can be applied within and across languages.
Complexity matching in the nested clustering of speech events
Complexity matching is a recently hypothesized expression of speech convergence and interpersonal coordination inspired by theoretical work on interactions between complex networks (Abney et al., Reference Abney, Paxton, Dale and Kello2014; Marmelat & Delignières, Reference Marmelat and Delignières2012; West, Geneston & Grigolini, Reference West, Geneston and Grigolini2008). West et al. (Reference West, Geneston and Grigolini2008) theorized that when complex networks interact, the exchange of information will be maximized when the distribution of events between networks share a common dynamic. Exchange of information was defined in terms of mutual impact on each other's event dynamics, and dynamics were defined in terms of the timing of network events.
Inspired by West and colleagues, Abney et al. (Reference Abney, Paxton, Dale and Kello2014) proposed complexity matching as a measure of coordination dynamics in speech. They converted speech waveforms into time series of speech events that capture hierarchical temporal structure (HTS) in the nested clustering of acoustic speech energy. The timescale of small clusters roughly corresponds with phonemes, larger timescales with syllables and words, still larger phrases and sentences, up to the largest timescales that cover conversational turns and other discourse-level dynamics. Clustering across timescales reflects the hierarchical nesting of linguistic units such as phonemes within syllables, syllables within words, words within phrases, and so on (for direct evidence of this claim, see Falk & Kello, Reference Falk and Kello2017).
Abney et al. (Reference Abney, Paxton, Dale and Kello2014) analyzed matching in the nested relations among temporal clusters of acoustic speech events in two different types of conversations. Each pair of partners was cued to have one affiliative conversation about a topic of shared interest, and one argumentative conversation about a topic on which partners took opposing sides. Complexity matching was found in the affiliative conversation but not the argumentative conversation, indicating that the measure of nested clustering reflects more than just turn taking or other general effects of speaking together. Abney, Warlaumont, Oller, Wallot, and Kello (Reference Abney, Warlaumont, Oller, Wallot and Kello2017) followed up with another effect on complexity matching, this time in the vocal interactions between infants and adult caregivers. More matching was found when infants produced speech-related sounds as opposed to non-speech sounds, and adults adjusted their HTS to match that of infants, rather than vice versa.
These two prior studies provide evidence that complexity matching is not just a basic consequence of turn-taking or other general factors of speaking together – complexity matching is more prevalent during positive, communicative interactions, which is evidence that complexity matching reflects a functional convergence in speech interactions. There is also evidence that the nested clustering of events, i.e., the statistic that converges in complexity matching also reflects functional aspects of speech. Kello, Dalla Bella, Médé, and Balasubramaniam (Reference Kello, Dalla Bella, Médé and Balasubramaniam2017) measured HTS in conversations as well as TED talk monologues spoken in six different languages, including English and Spanish. Language had no effect on HTS, but conversations yielded greater HTS than monologues, and original TED talks yielded greater HTS than speech synthesized versions of TED talk transcriptions. These effects indicate that HTS varies with discourse processes (dialog versus monologue) and prosody (natural versus synthetic), suggesting that complexity matching may vary depending on the timescales analyzed.
Lexical matching in word frequency distributions
Whereas complexity matching is a measure of convergence based on a physical measure of the speech signal (the amplitude envelope), lexical entrainment is a measure of convergence based on non-physical representations of words or lemmas. Specifically, lexical entrainment is the tendency for speakers to choose and repeat certain words as referents during conversations (Anderson, Garrod & Sanford, Reference Anderson, Garrod and Sanford1983; Brennan & Clark, Reference Brennan and Clark1996; Clark & Brennan, Reference Clark, Brennan, Resnick, LB, John and Teasley1991). For instance, in the study by Brennan and Clark (Reference Brennan and Clark1996) mentioned earlier, speakers were observed to form “conceptual pacts” by converging on certain words to signal certain referents. To measure lexical entrainment, the authors computed the probability of target words being produced when cued from trial to trial.
Lexical entrainment has also been found in more open-ended speech exchanges. For instance, Nenkova, Gravano, and Hirschberg (Reference Nenkova, Gravano and Hirschberg2008) measured lexical entrainment in conversations by quantifying similarities in the proportions of times conversational partners used certain words. Likewise, Levitan, Benus, Gravano, and Hirschberg (Reference Levitan, Benus, Gravano and Hirschberg2015) measured the entrainment of turn-taking behaviors between speakers using Kullback-Leibler Divergence (KLD), which measures the degree to which one probability distribution is contained within another. Levitan et al. (Reference Levitan, Benus, Gravano and Hirschberg2015) compared KLD between conversational partners and surrogate pairs. As expected, KLD was less for partners compared with surrogates, which means that the distributions of word usage for two given partners had more overlap compared with surrogates.
It is not straightforward to apply the above methods to measure lexical entrainment across languages because there may not be a clear mapping between their respective lexicons. In one study, Ni Eochaidh (Reference Ni Eochaidh2010) found lexical entrainment across languages by using a highly constrained bilingual speech task that allowed for unambiguous mappings between lexical items across English and Irish. Otherwise there is a dearth of empirical studies testing lexical entrainment across languages, although a study by Bortfeld and Brennan (Reference Bortfeld and Brennan1997) suggests that less facility with a second language may not interfere with the effect – they found lexical entrainment to occur equally for language interactions in which speakers were more or less proficient at the language spoken.
We use the term “lexical matching” to refer to our measure of convergence in the frequencies of lemma usage, where lemmas abstract over the surface forms of words, and provide a more consistent basis for translation across languages. We use a variant of KLD known as Jensen-Shannon Divergence (JSD), which is simply a symmetric version of KLD. JSD is normalized so that zero means identical probability distributions, and one means completely non-overlapping distributions, i.e., no lemmas shared by conversational partners. JSD requires a correspondence between words spoken, which is mostly a given when conversational partners are speaking the same language. When two different languages are spoken, as in lingua receptiva, we can measure overlap by translating the lemmas of one language into those of the other. Translations can vary depending on the translator, but this issue is addressed by using surrogates as baselines from the same translator. Lexical matching was measured as significant differences from a baseline for both intra- and inter-language conversations. Lexical matching is based on direct correspondences in the intra-language conditions, and translations based on corresponding semantics in the inter-language conditions.
In the present study, we investigated speech convergence within and between English and Spanish in naturalistic conversations using complexity matching and lexical matching as two different measures of convergence. The monolingual English condition served as a baseline and test for replicating prior findings of complexity matching and lexical matching, whereas the monolingual Spanish and bilingual Mixed conditions extended previous investigations of complexity matching and lexical matching to a new language and to inter-language bilingual conversations. We compared the degrees of matching across language conditions, and tested whether inter-language convergence occurs in acoustic and linguistic measures of matching. The range of language backgrounds was relatively narrow in our participant pool, but we also examined whether Spanish and English, as a dominant or non-dominant language, had an effect on convergence when Spanish was spoken. Finally, we tested whether degrees of complexity and lexical matching were positively correlated with each other on the hypothesis that they share a common basis, e.g., via interactive alignment across levels of processing.
Sixty participants (mean age = 19.45; males = 8, females = 52) were recruited from the University of California, Merced, through the SONA participant pool for course credit. Participants were recruited in pairs, and two pairs were omitted from all analyses due to technical difficulties with the audio recordings (remaining subjects: mean age = 19.35; males = 5, females = 51). Our sample size is comparable to that of prior studies involving convergence in dyadic speech (Abney et al., Reference Abney, Paxton, Dale and Kello2014; Falk & Kello, Reference Falk and Kello2017; Marmelat & Delignières, Reference Marmelat and Delignières2012; Pardo et al., Reference Pardo, Gibbons, Suppes and Krauss2012; Pardo, Urmanache, Gash, Wiener, Mason, Wilman, Francis & Decker, Reference Pardo, Urmanache, Gash, Wiener, Mason, Wilman, Francis and Decker2018). All but three dyads reported not knowing one another prior to the experiment, and the remaining three dyads were acquaintances. Participants reported their native language as either Spanish (n = 24), English (n = 5), or both Spanish and English (n = 17). One participant also reported Punjabi as their second language. Participants rated which language they used most comfortably on a daily basis: 7 reported Spanish, 30 English, and 14 both Spanish and English equally (5 non-responses). Finally, the native countries of origin included the United States (n = 39), Mexico (n = 13), El Salvador (n = 2), and both Mexico and the United States (n = 2). One of the native Spanish speaking experimenters listened to the conversations and confirmed that all participants were able to hold a conversation in Spanish. Additional information about the participant's average proficiency may be found in Table 1.
Dyads were seated at a table in a small room (8.5’ by 7’) facing one another, approximately 2.5 feet apart, where they engaged in three conversations. Each conversation was recorded using an M-Audio Mobile Pre-amp, two Shure SM10A headset microphones, and the Audacity 2.0.2 audio software.
Upon arrival, participants were instructed to read and sign a consent form, which explained that they could end the experiment at any time without penalty. Participants were then instructed to turn their cell phones off or onto silent mode to avoid interruptions. Participants filled out language background and proficiency information before the study began (see Appendix A). Each pair of participants was informed that they would be engaging in three five-minute conversations with each other. The prompted topic of one conversation was movies, another music, and the third television. Finally, one of the conversations was spoken in English, one in Spanish, and the other in a Mixed condition using both languages, with language in the latter randomly assigned to speakers. Conversation topic, language condition, and order were all randomized such that each possible combination occurred with equal probability (conditions were not counterbalanced).
Dyads were prompted to introduce themselves and chat casually for a few minutes before starting the experiment, while the experimenter tested the audio. This initial conversation was in English, during which the experimenter adjusted the input gain on each microphone relative to each person's speaking voice. During each recorded conversation, pieces of paper were taped on the wall to display the current language(s) and topic of conversation, as a reminder to participants. Participants began each conversation after the researcher said ‘begin’ or ‘start’. The researcher sat in the same room facing away from the participants to monitor the recordings and did not engage with the participants during each five-minute conversation, or trial. Upon completion of the three conversations, participants rated each conversation in terms of ease of communication and comfortability on a scale of one to five, with five being easiest and most comfortable (see Appendix B).
Each conversation was recorded to an uncompressed stereo WAV file, with the output of one microphone sent to the left channel, and other to the right channel. The input gain level for each participant was adjusted to ensure adequate recording levels while minimizing cross-talk between the microphones. Despite best efforts, a small amount of cross-talk occurred at some points during some of the recordings. Audacity was used to remove cross-talk from the recordings before analysis. A noise profile was selected based on cross-talk examples chosen manually from a visual display of the speech waveform (the experimenter listened to confirm cross-talk), and the selected profile was applied to the whole recording to filter out acoustic energy that resembled cross-talk. For this filtering function, the sensitivity parameter was always set to 25 and the frequency smoothing parameter to 3. In addition to acoustic analyses, each recording was transcribed using TranscribeMe (http://transcribeme.com) for both Spanish and English. Two researchers reviewed all the transcriptions for quality control. Researchers confirmed that as instructed, on average, dyads in the Mixed condition only accidentally switched languages an average of about two times, and they corrected themselves shortly thereafter. In the Spanish condition, five dyads made one mistake each, and no participants made any mistakes in the English condition. No participants switched languages accidentally in the pure English or Spanish conditions.
Complexity matching and Allan Factor analysis
HTS was measured for each speaker in terms of the amount of nested clustering in peak amplitude speech events across timescales. Specifically, the following operations were performed on the waveform for each speaker (for more detail, see Kello et al., Reference Kello, Dalla Bella, Médé and Balasubramaniam2017): 1) peaks of the amplitude envelope were selected using a variable amplitude set to hold the number of peaks constant across speakers; 2) a log-log Allan Factor function was computed for each event series, which quantifies the change in clustering across timescales; and 3) the rate of increase in clustering was quantified in the shorter and longer timescales by fitting a regression line to each half of the AF function. Correlations in partner slopes for the longer timescales served as our main measure of complexity matching, the timescales where variations in prosody and discourse are expressed. Using correlations to measure complexity matching obviated the need for surrogate analyses because correlations have an inherent baseline of 0 for no linear relationship.
Acoustic event analysis was based on prior studies of speech event series (Abney et al., Reference Abney, Paxton, Dale and Kello2014, Reference Abney, Warlaumont, Oller, Wallot and Kello2017; Falk & Kello, Reference Falk and Kello2017; Luque, Luque & Lacasa, Reference Luque, Luque and Lacasa2015). Events were peaks in the amplitude envelope of the speech waveform downsampled to 11 KHz, and clusters of peaks roughly correspond to units of speech like syllables, words, and phrases (Falk & Kello, Reference Falk and Kello2017). As in Kello et al. (Reference Kello, Dalla Bella, Médé and Balasubramaniam2017), peak events were identified in the amplitude envelope using two thresholds to ensure peaks were locally maximal and relatively large in amplitude. These thresholds helped filter out noise, normalize recording levels, and produce a sparse event series with the possibility of a large dynamic range in cluster sizes. Kello et al. (Reference Kello, Dalla Bella, Médé and Balasubramaniam2017) showed that AF analyses are not substantially affected by moderate changes to these threshold settings.
Upon completion of the acoustic event analysis, nested clustering of peak amplitude events was quantified using AF analysis, which has been used previously for measuring clustering in neuronal spike events (Lowen & Teich, Reference Lowen and Teich1996). The event series are windowed at multiple timescales, and events are counted within each window. The difference in event counts between adjacent windows is used in a statistical measure of temporally local variance in event counts, where greater variance corresponds with greater clustering. The average degree of clustering is quantified as AF variance A(T) for a given window size T (see Figure 1). A(T) was computed for 11 timescales ranging from approximately 30 ms to 30 sec.
If events are clustered across timescales, then A(T) > 1 and increases with each larger timescale. If events are random or evenly distributed, then A(T) ≈ 1 across timescales (AF analysis does not distinguish between random and periodic events). As mentioned above, AF functions were divided into the lower and upper halves of timescales (timescales 1–6 and 7–11, respectively). The shorter timescales roughly correspond to variability in the timing of smaller units of speech, such as phonemes, syllables, and words, whereas the longer timescales roughly correspond to variability in the larger units of speech, such as phrases, sentences, and turns.
Lexical matching and JSD analysis
Lexical matching in the probability distributions over lemma usage was measured by quantifying overlap in the probability distributions of lemma frequencies used by the two speakers in each dyad. Words were coded in terms of their underlying lemmas using English and Spanish lemma dictionaries that replaced inflected words with their lemma roots (derived from http://www.corpora.heliohost.org). If a word was not in the dictionary, its originally transcribed form was preserved. For participants assigned to speak Spanish in the Mixed condition, one Spanish–English bilingual researcher listened to each conversation and then translated the individual Spanish words into their closest probable English counterparts. A second Spanish–English bilingual researcher reviewed these translations and resolved any discrepancies in the translations with the other researcher.
For each lemma spoken by each person in each conversation, its probability of occurring was computed as its token frequency divided by the total number of lemma tokens spoken by that person in that conversation. To illustrate some example lemmas and their frequencies, Table 2 shows the 20 most frequent lemmas for one randomly selected example dyad in the English language condition, and the same dyad in the Spanish language condition. The number of unique English words in either the English or Mixed condition was significantly higher (M = 136.63, SD = 32.37) than Spanish words in either the Spanish or Mixed condition (M = 116.30, SD = 27.76), t(162.23) = 4.37, p < .001.
The overlap between each participant's lemma distribution was quantified using JSD:
JSD is the mean Kullback-Leibler divergence for each participant's probability distribution, A and B, relative to their combined probability distribution AB. JSD = 1 means totally non-overlapping frequency distributions, and JSD = 0 means identical distributions. JSD values were compared against a baseline to determine if the lemma distributions for a given dyad overlapped more than expected by chance. Using the previously mentioned translated data created by one of our bilingual researchers, in which Spanish words were translated to English words, a surrogate JSD value was determined for each participant in each conversation by pairing his or her lemma frequency distribution with that of all other participants in a different dyad, but in the same language and topic condition. Surrogate JSD values were averaged and compared per dyad.
Mean AF functions for each of the three language conditions are shown in Figure 2. The nearly straight lines indicate roughly self-similar nested clustering across timescales. The bend in the functions suggests that clustering was more nested in the shorter timescales, as reflected in a steeper slope to the curve on the left side compared with the right. Visual inspection of the three mean AF functions suggests little effect of conversation type.
Before measuring complexity matching, we first tested whether slopes of AF functions differed by language condition or timescale. We ran a two-way repeated measures ANOVA with language (English, Spanish, Mixed) and timescale (short, long) as fixed factors, the slope of the AF function in log-log coordinates as the dependent variable, and individual participants as the random factor. Slopes did not differ as a function of language (F(2,324) = 0.41, p > .6, MSE = 0.01), but they were steeper in the shorter timescales (F(1,324) = 122.97, p < .001, MSE = 4.07). There was no interaction between language and timescale (F(2,324) = 0.18, p > .8, MSE = 0.01).
To test for effects of order and conversation type, a three-way repeated measures ANOVA was conducted with trial number, conversational topic, and timescale as the independent variables, slope as the dependent variable, and dyad as the random factor. No main effect was found for trial, F(1,312) = 1.52, p > .2, MSE = 0.05, or topic, F(2,312) = 0.05, p > .9, MSE = .002. Additionally, no interaction was found between the three independent variables, suggesting that the effect of timescale did not influence the effects of trial or topic, F(2,312) = 0.64, p > .5, MSE = 0.02.
Correlations of AF slopes
Complexity matching in AF functions was analyzed using a linear mixed effects regression, which predicted one speaker's AF slope based on their partner's. In addition, we included the fixed effects of conversation order (1–3), timescale (short or long), language condition (English, Spanish, Mixed), and dyad language experience, where the latter was a binary variable indicating whether or not both participants reported Spanish as their primary, native language. Dyad was set as the random effect with a random intercept and random slope.
A reliable effect of overall complexity matching was found, B = 0.87, t(52.8) = 18.35, p < .001. The interaction between timescale and slope was added to the model to test for differences between the short and long timescales. Matching was reliable in the longer timescales (B = 0.54, t(87.4) = 6.77, p < .001) but not the shorter timescales (B = 0.18, t(136.5) = 1.05, p = .3), although the interaction with timescale was only marginally reliable (B = 0.36, t(126.7) = 1.87, p = .06; see Figure 3A). Complexity matching did not vary reliably as a function of language condition (all p > .2; see Figure 3B), conversation order (all p > .3), or conversation topic (all p > .5).
The results so far indicate that complexity matching does not require participants to speak the same language nor the same words or sentences. Therefore, in theory, complexity matching should be applicable to the same person compared with himself or herself speaking in two different conversations. The idea is that complexity matching may measure the character and style of a person's speech, regardless of language or conversation. We consider this possibility because AF variance removes information about specific clusters of peak events, and instead gauges their variance in cluster sizes and durations. Variance in clustering may be comparable in a person's speech across different languages, even though different speech units are produced.
We tested within-speaker matching by running a linear mixed effects model to correlate each participant's AF slopes from the English and Spanish conditions with his or her AF slopes in the Mixed condition. Half of the participants spoke English in the Mixed condition, and half spoke Spanish. Therefore, each participant provided one data point speaking the same language (English and Spanish were merged for this analysis), and one data point speaking different languages. We found that AF slopes correlated within individual speakers across different conversations (B = 0.77, t(37.96) = 15.63, p < .001), but there was no reliable interaction with language condition (B = −0.04, t(177.32) = −0.56, p > .05) or timescale (B = 0.09, t(184.26) = 0.57, p > .05). Taken together, these results indicate that speakers exhibit patterns of nested clustering across all measured timescales that are consistent with themselves across languages and conversations.
Next, we tested whether complexity matching varied as a function of language experience. Speakers were variable in how they used the ratings scale, and ratings were mostly subjective. Instead, we chose to focus on a simple binary categorization: if both members of a dyad listed Spanish as a native language and English as a secondary language, the dyad was categorized as Spanish primary (13 dyads), and otherwise English primary (15 dyads). For the Spanish and Mixed conditions, the degree of complexity matching was not reliably affected by language experience, B = 0.10, t(17.09) = 0.65, p > .5. Therefore, it appears that language fluency did not vary enough in our sample of participants to affect complexity matching. Nearly all of our participants had similar language backgrounds, in that they were native Californians from families with Mexican heritage who speak a Californian dialect of Spanish and used it on a regular basis.
We tested for matching in lemma usage using a three-way mixed design ANOVA, with language condition (English, Spanish, or Mixed) and JSD type (original or surrogate) as independent repeated measures factors, language experience as an independent between-subjects factor, JSD value as the dependent variable, and dyad as the random effect. A significant main effect of JSD type was found, F(1,150) = 11.76, p < .01, MSE = 0.02, and Figure 4 shows that original JSD values were less divergent than surrogates, indicating an overall effect of lexical matching. This effect cannot be attributed to using words that are common to a given topic of conversation because the JSD surrogates were drawn from the same conversational topic as their corresponding originals. There was also a main effect of language condition, F(2,150) = 81.42, p < .001, MSE = 0.11, reflecting that JSD values were most divergent, or least convergent, in the Mixed condition, followed by Spanish and then English. This main effect may be due to differences in the lemma dictionaries used, and/or unavoidable issues with translation. Importantly, the difference between originals and surrogates were not reliably different for Mixed compared to pure language conditions.
To test whether lexical matching (i.e., the difference between original and surrogate JSD values) varied as a function of language condition, we again used a three-way mixed design ANOVA, with language condition (English, Spanish, or Mixed) and JSD type (original or surrogate) as independent repeated measures factors, language experience as an independent between-subjects factor, JSD value as the dependent variable, and dyad as the random effect. There was no reliable interaction between JSD type and language condition, F(2,150) = 0.38, p > .6, MSE < 0.001.
We also tested for effects of conversation order and topic on lexical matching using two additional three-way ANOVA models. To test for order effects, trial number, language experience, and JSD type were set as independent variables, score as the dependent variable, and dyad as the random factor. The same ANOVA was tested for conversational topic, where topic simply replaced trial number in the independent variables. Neither trial, F(1,156) = 0.05, p > .8, nor topic, F(2,150) = 0.01, p > .9, interacted with JSD type. In summary, like complexity matching, lexical matching appears to be robust to both intra- and inter-language interactions, and unaffected by variations in language experience in our participant pool.
Finally, we tested whether JSD score varied as a function of language experience and JSD type (original participant or surrogate). Again, we focused on a simple binary categorization: if both members of a dyad listed Spanish as a native language and English as a secondary language, the dyad was categorized as Spanish primary (13 dyads), otherwise English primary (15 dyads). A linear mixed effects regression tested the interaction between language experience (Spanish native or not Spanish native) and JSD type (original or surrogate). No interaction was found between these three conditions, F(1, 100) = 0.56, p > .4, nor was there a main effect of language experience, F(1,100) = 1.82, p > .1. We therefore found no evidence of an effect of language experience, consistent with the lack of effect for complexity matching (see above), which may have been due to homogeneity of language fluency in our population sample.
Relation between complexity matching and lexical matching
Analyses thus far indicate that both complexity matching and lexical matching occur in inter-language Spanish–English conversations (i.e., the Mixed condition), with no reliable difference in the magnitude of matching compared with purely English or purely Spanish conversations. For our last analyses, we examined whether complexity and lexical matching have a common basis by correlating their magnitudes. JSD difference values (original minus surrogate) are a direct measure of convergence in word usage for each given conversation, but so far we have measured complexity matching at the aggregate level in terms of AF slope correlations.
To provide a measure of complexity matching per conversation, we computed the absolute differences of AF slopes in the longer timescales produced by each dyad, which is inversely related to JSD difference scores. We then ran a linear mixed effects model with complexity matching as the dependent measure, and the negative of the JSD difference scores (to undo the inverse relationship) as the predictor, with language condition as a factor, and dyad as the random effect with a random intercept and random slope. Lexical matching was found to predict complexity matching, and vice versa, B = .74, t(82.0) = 3.71, p < .001.
We next tested if the relationship between complexity and lexical matching was mediated by language condition. A linear mixed effects regression was used to predict complexity matching based on the fixed effects of lexical matching and the interaction between language condition and lexical matching. Dyad was set as the random effect with a random intercept and random slope. A marginal interaction was found for the Spanish and English conditions only, B = 0.40, t(78) = 2.06, p = .04. Upon further investigation, we found that the correlation between lexical matching and complexity matching was slightly stronger for Spanish (B = −1.36) compared with English (B = −0.95). The reason for this marginal effect is unclear and its unexpectedness warrants further investigation.
The observed relationship between complexity and lexical matching does not appear to be directly causal because word durations are mostly shorter than the second+ timescales of complexity matching, and because surface forms of words are not directly matched in the Mixed condition. Therefore, overlap in the sounds of words was not the cause of complexity matching, or vice versa. Instead, it appears that convergence has an underlying basis that gives rise to both complexity and lexical matching.
In the present study, we examined two measures of convergence in conversations spoken in English and Spanish. We measured complexity matching and lexical matching when conversational partners spoke the same language, either English or Spanish, and compared these conditions to an inter-language Mixed condition during which one partner spoke English and the other Spanish. The main result was that both types of convergence occurred in all three language conditions, and both complexity matching and lexical matching were no less prevalent in inter-language conversations. Neither type of matching was modulated by the order or topic of conversation, which taken together demonstrates the robust and general nature of convergence in conversation.
Complexity matching was reliable only in the longer timescales, in the range of hundreds of milliseconds to tens of seconds, which is consistent with prior studies indicating that prosodic and discourse processes may be more variable and therefore malleable to convergence. Complexity matching across speakers was not reliable in the shorter timescales, suggesting either that AF slopes are too coarse to detect any small-scale effects of cross-speaker convergence (e.g., in VOTs; see Balukas & Koops, Reference Balukas and Koops2015), or HTS in the shorter timescales is relatively automatized over the course of language learning. However, complexity matching was reliable within speakers when comparing a speaker with himself or herself across different conversations and languages. This intra-person convergence illustrates how complexity matching captures HTS as reflective of speech “style” rather than the particular words and other linguistic units, because we observed within-person convergence across different conversations, even in different languages. Finally, complexity and lexical matching were reliably correlated with each other, suggesting that they arise from common underlying processes of convergence.
The observed equivalence of speech convergence within and between languages is consistent with the prevalence of lingua receptiva across cultures and generations (ten Thije, Reference ten Thije2013). Human language behaviors and processes appear to generalize over languages commonly and readily, despite myriad differences in linguistic units and structures that require distinct efforts devoted to learning each language. Many bilingual speakers naturally interact using both languages interchangeably, e.g., when there are asymmetries in receptive or productive speech competencies between conversational partners that lead speakers to communicate and coordinate using different languages. It is likely that lingua receptive and similar language experiences are common in the population of Californian Spanish–English bilinguals from which we sampled.
Theories of convergence generally explain the phenomenon in terms of shared activation of representations, or language processes following aligned trajectories, as in the interactive alignment theory (Garrod & Pickering, Reference Garrod and Pickering2009; Pickering & Garrod, Reference Pickering and Garrod2004; Trofimovich, Reference Trofimovich2016). This hypothesis appears to predict less matching across languages because at least some representations may not be directly aligned. For instance, prosody and discourse processes may be common across languages, and therefore complexity matching may be unaffected by the language spoken. To illustrate, bilingual speakers have voice qualities and styles that are recognizable across the languages spoken, and correlations in AF slopes appear to reflect a speaker's tendency to “bend” their voice towards their partner. By contrast, different languages usually have different wordform lexicons with orthographic and phonological representations that are categorical (symbolic) and cannot bend towards one another in order to align lexical representations. Therefore, our results indicate either that convergence does not rely on direct alignment – meaning the use of the same words – or language processes and representations are learned to be shared and aligned across languages for proficient bilingual speakers (Guo & Peng, Reference Guo and Peng2006; Kantola & van Gompel, Reference Kantola and van Gompel2011; Kroll, Bobb & Hoshino, Reference Kroll, Bobb and Hoshino2014; Marian & Spivey, Reference Marian and Spivey2003).
While it may be difficult to align lexical and other representations across languages, it is possible for language use to converge in the shapes of probability distributions over word usage and other levels of representation. Specifically, if we assume that speakers generally produce Zipfian frequency distributions (Zipf, Reference Zipf1935), as found in many prior studies across different languages (e.g., Peterson, Tenenbaum, Havlin, Stanley & Perc, Reference Peterson, Tenenbaum, Havlin, Stanley and Perc2012), then we may expect the shape of the power law (its exponent) produced by one speaker to bend towards that of their conversational partner, and vice versa. We could not test this hypothesis in the present study because the conversations were too short, with too few words produced per speaker, to estimate power law exponents.
In future studies, it would be interesting to measure lexical matching using Zipf's law in corpora or experiments that contain longer language interactions. It would also be interesting to test both lexical and complexity matching in bilingual interactions between pairs of languages that vary in their phonological, grammatical, and lexical similarity. Other sources of variability to be examined are the language fluencies and backgrounds of speakers. We did not find any effects of language experience on complexity or lexical matching, but our participants were drawn from a fairly homogenous population of Spanish-speaking Californians with family roots primarily in Mexico. Future studies could sample from a wider range of language fluencies, and greater variation in lingua receptiva experience, to test for effects of language experience on measures of matching.
Finally, we found no effect of conversation order on complexity matching or lexical matching, which suggests that convergence developed relatively quickly within the first five-minute conversation (see Brennan & Clark, Reference Brennan and Clark1996; Potter & Lombardi, Reference Potter and Lombardi1998). However, our measure of complexity matching is ill-suited to measuring time course effects because Allan Factor analysis requires approximately four to five minutes of audio in order to accurately measure hierarchical temporal structure. Likewise, JSD requires an entire word distribution. With longer conversations, one could slide a moving window over the acoustic signal or the time series of words spoken to measure the time course of complexity matching and lexical matching using measures like ours.
In conclusion, the present study explored convergence in monolingual and bilingual conversations using two new measures of matching. We provided evidence for complexity matching and lexical matching as general, robust phenomena in both monolingual and bilingual conversations. Together, these measures of speech convergence appear to reflect basic principles of social interaction and shared processes of monolingual and bilingual language interaction.
For supplementary material accompanying this paper, visit https://doi.org/10.1017/S1366728919000774
We thank Drew Abney and Rick Dale for providing input and assistance during the early stages of this project, and we thank Gilbert Sepulveda and Alexis Luna for their assistance in collecting data and transcribing the recordings. We also thank Kathleen Coburn for her input and statistical assistance. Finally, we would like to thank our reviewers for their detailed feedback that greatly helped to improve the paper. The research was partly supported by NSF award 1529127.
Declarations of interest
None. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Informed consent was obtained for experimentation with human subjects.
Appendix A: Pre-Experiment Language History Questionnaire
1. SONA ID:
4. Do you have any visual and/or hearing problems? If yes, what are they?
5. What is your native country/ies?
6. What is your native langue(s)?
7. What language is spoken in your household?
8. At what age(s) did you start to learn each language, and for how many years?
9. What would you consider to be your primary second language?
10. What language are you most comfortable using on a daily basis?
11. On a scale of one to ten, with ten being the highest level of confidence, please mark your proficiency in the following areas:
a. English reading
1 2 3 4 5 6 7 8 9 10
b. English spelling
1 2 3 4 5 6 7 8 9 10
c. English writing
1 2 3 4 5 6 7 8 9 10
d. English speaking
1 2 3 4 5 6 7 8 9 10
e. English speech comprehension
1 2 3 4 5 6 7 8 9 10
12. On a scale of one to ten, with ten being the highest level of confidence, please mark your proficiency in the following areas:
a. Spanish reading
1 2 3 4 5 6 7 8 9 10
b. Spanish spelling
1 2 3 4 5 6 7 8 9 10
c. Spanish writing
1 2 3 4 5 6 7 8 9 10
d. Spanish speaking
1 2 3 4 5 6 7 8 9 10
e. Spanish speech comprehension
1 2 3 4 5 6 7 8 9 10
13. Estimate, in terms of percentages, how often you use your dominant language and other languages per week (in all weekly activities combined, circle which range best applies):
Dominant language: 0% 0–25% 50–75% 75–100%
Second language: 0% 0–25% 50–75% 75–100%
Appendix B: Post-Experiment Questionnaire
1. Have you ever met your partner before today? If so, are you just acquaintances, or friends?
2. On a scale of 1 to 5, how easy was the conversation in which you both spoken English, with 5 being the easiest?
1 2 3 4 5
3. On a scale of 1 to 5, how easy was the conversation in which you both spoke Spanish, with 5 being the easiest?
1 2 3 4 5
4. On a scale of 1 to 5, how easy was the conversation in which you spoke two different languages, with 5 being the easiest?
1 2 3 4 5