Testing the validity of the Cross-Linguistic Lexical Task as a measure of language proficiency in bilingual children

Elise VAN WONDEREN; Sharon UNSWORTH

doi:10.1017/S030500092000063X

Testing the validity of the Cross-Linguistic Lexical Task as a measure of language proficiency in bilingual children

Published online by Cambridge University Press: 17 November 2020

Elise VAN WONDEREN and

Sharon UNSWORTH

Show author details

Elise VAN WONDEREN: Affiliation:
Centre for Language Studies, Radboud University, Nijmegen, Netherlands
Sharon UNSWORTH*: Affiliation:
Centre for Language Studies, Radboud University, Nijmegen, Netherlands
*: *Address for correspondence: Centre for Language Studies, Radboud University, Postbus 9103, 6500HDNijmegen, the Netherlands. E-mail: s.unsworth@let.ru.nl

Article contents

Abstract
Introduction
Study 1
Study 2
General discussion
Implications and conclusion
Footnotes
References

Rights & Permissions

Abstract

The Cross-linguistic Lexical Task (CLT; Haman, Łuniewska & Pomiechowska, 2015) is a vocabulary task designed to enable cross-linguistic comparisons both across and within (bilingual) children. In this paper we assessed the validity of the CLT as a measure of language proficiency in bilingual children, by determining the extent to which (i) age-matched, monolingual Spanish-speaking and Dutch-speaking children obtained similar scores, (ii) the CLT correlated with other measures of language proficiency in monolingual and bilingual children, and (iii) whether the factors underlying the CLT's construction, i.e., target words’ estimated Age of Acquisition and Complexity Index, were predictive of children's scores. Our results showed that, while the CLT correlated with other measures and is therefore a valid means of tapping into language proficiency, caution is required when using it to compare children's language proficiency cross-linguistically, as scores for Dutch-speaking and Spanish-speaking monolinguals sometimes differed.

Keywords

language proficiency vocabulary task child bilingualism test validity

Type: Article
Information: Journal of Child Language , Volume 48 , Issue 6 , November 2021 , pp. 1101 - 1125

DOI: https://doi.org/10.1017/S030500092000063X [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: Copyright © The Author(s), 2020. Published by Cambridge University Press

Introduction

A complete understanding of a bilingual child's linguistic development requires examining their skills in both languages. In certain circumstances, it is arguably essential to do so. For example, measuring children's proficiency in both languages is crucial for accurate diagnoses regarding language impairment to be made. Both underdiagnosis and overdiagnosis of language impairment are reported in the literature (Boerma, Chiat, Leseman, Timmermeister, Wijnen & Blom, Reference Boerma, Chiat, Leseman, Timmermeister, Wijnen and Blom2015), indicating that language impairment is either overlooked – because language delays in the language of schooling are mistakenly ascribed to the child's bilingual status – or that bilingual children are erroneously diagnosed with language impairment because their proficiency in the other language has not been taken into account.

Measuring proficiency in both languages is also relevant for researchers of child bilingualism, where language proficiency is of interest both as a predictor and as an outcome. Relative language proficiency, i.e., how proficient children are in one language compared to the other, is often taken as a proxy for language dominance, and as such is hypothesized to (partly) explain the magnitude and direction of cross-linguistic influence (e.g., Yip & Matthews, Reference Yip and Matthews2000), and the direction and type of code-switching (e.g., Paradis & Nicoladis, Reference Paradis and Nicoladis2007). The extent to which bilingual children are proficient in both of their languages has also been used as a predictor for the extent of cognitive advantages compared to monolingual children (e.g., Blom, Küntay, Messer, Verhagen & Leseman, Reference Blom, Küntay, Messer, Verhagen and Leseman2014). Conversely, researchers have tried to explain children's (relative) language proficiency across development by looking at child-internal factors, such as age of acquisition, and child-external variables, such as amount of exposure (e.g., Bedore, Peña, Griffin & Hixon, Reference Bedore, Peña, Griffin and Hixon2016; Bedore, Peña, Summers, Boerger, Resendiz, Greene, Bohman & Gillam, Reference Bedore, Peña, Summers, Boerger, Resendiz, Greene, Bohman and Gillam2012; Chondrogianni & Marinis, Reference Chondrogianni and Marinis2011; Hoff, Core, Place, Rumiche, Señor & Parra, Reference Hoff, Core, Place, Rumiche, Señor and Parra2012; Unsworth, Chondrogianni & Skarabela, Reference Unsworth, Chondrogianni and Skarabela2018).

When trying to measure bilingual children's proficiency in both languages, researchers and clinicians are faced with a challenge: as for many of the (standardized) proficiency tasks, only a limited number of language versions exist (but cf. Bilingual Verbal Ability Tests, available in 18 languages; Muñoz-Sandoval, Cummins, Alvarado & Ruef, Reference Muñoz-Sandoval, Cummins, Alvarado and Ruef2005). Furthermore, as different language versions of the same task are often direct translations of each other they may not be entirely comparable, because they do not always take into account cross-cultural differences and they may differ in complexity (Gathercole, Thomas & Hughes, Reference Gathercole, Thomas and Hughes2008; Haitana, Pitama & Rucklidge, Reference Haitana, Pitama and Rucklidge2010; Peña, Reference Peña2007).

Language proficiency is typically measured by tasks targeting (morpho)syntactic skills or vocabulary size. Vocabulary size has been shown to correlate with a range of language skills, including grammatical ability (e.g., Meara, Reference Meara, Brown, Malmkjagr and Williams1996; Miralpeix & Muñoz, Reference Miralpeix and Muñoz2018) and, as such, vocabulary tests are often used as a proxy for general language proficiency. However, direct translations of vocabulary tests are often not entirely comparable, as so-called translation equivalents may differ in terms of frequency (e.g., curve is a high frequent word in English, but the Welsh translation cromlin has a very low frequency, Gathercole et al., Reference Gathercole, Thomas and Hughes2008), specificity (e.g., in Dutch only the verb scheuren “to rip” is used for ripping paper, whereas in Spanish you can also use the more general verb romper “to break”), number of synonyms (Dutch only uses the verb aanbellen for ringing the doorbell, whereas Spanish includes llamar, llamar a la puerta/al timbre, and tocar/picar el timbre), and morphological and phonological complexity (i.e., vrachtwagen in Dutch is a trisyllabic compound noun with two consonant clusters, whereas the English truck is a monosyllabic noun with only one consonant cluster).

To address some of these issues, Haman, Łuniewska and Pomiechowska (Reference Haman, Pomiechowska, Armon-Lotem, De Jong and Meir2015) developed the Cross-linguistic Lexical Task (CLT). The CLT is an attempt to create cross-linguistically comparable vocabulary tests for a large number of languages: there are 29 language versions readily available, and new language versions can be easily constructed in consultation with the original authors. As part of the LITMUS test battery (Armon-Lotem, Meir & de Jong, Reference Armon-Lotem, Meir and De Jong2015), the CLT was originally created as a diagnostic tool to identify language impairment in multilingual populations. Its aim is to be fully comparable across languages, to enable comparison between bilingual and monolingual populations, between the two languages of a bilingual child, and between typically developing and language-impaired populations (Haman et al., Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2015). To this end, the construction of the CLT in each language is based on two language-specific properties: an estimation of the target words’ age of acquisition (AoA) and a composite measure of the target words’ complexity.

In the present study we assess the validity of this task for comparing language proficiency across and within bilingual children, by investigating the performance of both monolingual and bilingual children on the Spanish and Dutch CLT. We test whether monolingual Dutch and Spanish children obtain comparable scores; whether CLT scores correlate with other measures of language proficiency; and, following Hansen, Simonsen, Łuniewska and Haman (Reference Hansen, Simonsen, Łuniewska and Haman2017), whether the factors underlying the construction of the CLT – i.e., word complexity and AoA – are predictive of children's scores.

Properties underlying the CLT

The CLT consists of a picture naming and a picture selection task, each further divided into a noun and verb subtask. There are thus four subtasks with 30 items in each. The construction of the CLT in each language is based on: (a) subjective estimations of AoA for each target word, and (b) a target word's complexity index (CI).

For each language, subjective AoA estimates were obtained by asking at least 20 native speakers to estimate the age at which they could understand each word, ranging from 0 (i.e., during the first year of life) to 18 (i.e., at age 18 or later) (Łuniewska, Haman, Armon-Lotem, Etenkowski, Southwood, Anđelković, Blom, Boerma, Chiat, de Abreu, Gagarina, Gavarro, Hakansson, Hickey, de Lopez, Marinis, Popovic, Thordardottir, Blaziene, Sanchenz, Dabasinskiene, Ege, Ehret, Fritsche, Gatt, Janssen, Kambanaros, Kapalkova, Kronqvist, Kunnari, Levorato, Nenonen, Fhlannchadha, O'Toole, Polišenská, Pomiechowska, Ringblom, Rinker, Roch, Savic, Slancova, Tsimpli & Ünal-Logacev, Reference Łuniewska, Haman, Armon-Lotem, Etenkowski, Southwood, Anđelković, Blom, Boerma, Chiat, de Abreu, Gagarina, Gavarro, Hakansson, Hickey, de Lopez, Marinis, Popovic, Thordardottir, Blaziene, Sanchenz, Dabasinskiene, Ege, Ehret, Fritsche, Gatt, Janssen, Kambanaros, Kapalkova, Kronqvist, Kunnari, Levorato, Nenonen, Fhlannchadha, O'Toole, Polišenská, Pomiechowska, Ringblom, Rinker, Roch, Savic, Slancova, Tsimpli and Ünal-Logacev2016a). This measure was deemed a valid measure of a word's AoA as the estimates correlated with child data on MacArthur-Bates Communicative Development Inventories (CDI) and with previous AoA estimates.

The CI is a composite measure consisting of both phonological and morphological aspects, and exposure to the depicted object or action (see Table 1). Phonological aspects have the largest impact on the CI score and, in particular, word length in phonemes (normalized within word class).

Table 1. Variables included in the complexity index (CI). Table adapted from Hansen et al. (Reference Hansen, Simonsen, Łuniewska and Haman2017) such that loan word status is no longer excluded (Magdalena Łuniewska, p.c.).

Based on these two measures, words are divided into four difficulty levels (early/late AoA, high/low CI). Target words are selected from a list of 158 nouns and 142 verbs, chosen for consistency in naming agreement across languages (Haman et al., Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2015). From each difficulty level, 7 to 8 words are randomly selected as target words for both comprehension and production.

Previous findings of studies using the CLT

Studies using the CLT have found significant effects of age on children's performance (Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, Engel de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović & Armon-Lotem, Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017), except when bilingual children were tested in the minority language (Bohnacker, Lindgren & Öztekin, Reference Bohnacker, Lindgren and Öztekin2016; Lindgren & Bohnacker, Reference Lindgren and Bohnacker2019), echoing previous findings showing that language development in the minority language is more dependent on exposure patterns than age (e.g., Hoff, Welsh, Place & Ribot, Reference Hoff, Welsh, Place, Ribot, Grüter and Paradis2014). Accordingly, amount of – cumulative or current – exposure and dominance have both been found to predict bilingual children's CLT scores (Bohnacker et al., Reference Bohnacker, Lindgren and Öztekin2016; Gatt, Attard, Łuniewska & Haman, Reference Gatt, Attard, Łuniewska and Haman2017; Potgieter & Southwood, Reference Potgieter and Southwood2016). Given enough variation in the sample, SES was also a significant predictor (Potgieter & Southwood, Reference Potgieter and Southwood2016; cf. Bohnacker et al., Reference Bohnacker, Lindgren and Öztekin2016).

In line with the frequently made observation that bilingual children may lag behind monolingual peers in at least one of their two languages on (expressive) vocabulary, some studies found significant differences between monolinguals and bilinguals on the CLT (Altman, Goldstein & Armon-Lotem, Reference Altman, Goldstein and Armon-Lotem2017; Hansen et al., Reference Hansen, Simonsen, Łuniewska and Haman2017). At the same time, other studies have failed to find such a difference, perhaps because they included simultaneous (rather than sequential) bilingual children (Lindgren, Reference Lindgren2018), or children with substantially more input in the language of testing than in their other language(s) (Potgieter & Southwood, Reference Potgieter and Southwood2016). Crucial for the validity of the CLT as a diagnostic tool, both quantitative and qualitative differences were found between typically-developing children and children with language impairment, both in monolingual and bilingual populations (Kapalková & Slančová, Reference Kapalková and Slančová2017; Khoury Aouad Saliby, Dos Santos, Kouba Hreich & Messarra, Reference Khoury Aouad Saliby, Dos Santos, Kouba Hreich and Messarra2017).

Two other trends extensively reported in the literature on children's (early) lexical development are the precedence of nouns over verbs, and comprehension over production (e.g., Bornstein, Cote, Maital, Painter, Park, Pascual, Pêcheux, Ruel, Venuti & Vyt, Reference Bornstein, Cote, Maital, Painter, Park, Pascual, Pêcheux, Ruel, Venuti and Vyt2004; Clark & Hecht, Reference Clark and Hecht1983; Gentner, Reference Gentner1982). Accordingly, many CLT studies have found higher accuracy for nouns over verbs, and for comprehension over production (e.g., Altman et al., Reference Altman, Goldstein and Armon-Lotem2017; Gatt et al., Reference Gatt, Attard, Łuniewska and Haman2017; Haman et al., Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017).

Important for the validity of the CLT as a measure of vocabulary skills specifically, and overall language proficiency more generally, scores on the CLT have been found to correlate with parental report of children's overall language proficiency and with other standardized vocabulary tasks, although only consistently so for comprehension (Hansen, Łuniewska, Simonsen, Haman, Mieszkowska, Kołak & Wodniecka, Reference Hansen, Łuniewska, Simonsen, Haman, Mieszkowska, Kołak and Wodniecka2019; Khoury Aouad Saliby et al., Reference Khoury Aouad Saliby, Dos Santos, Kouba Hreich and Messarra2017). Note, however, that Khoury Aouad Saliby et al. (Reference Khoury Aouad Saliby, Dos Santos, Kouba Hreich and Messarra2017) used conceptual scoring for production, i.e., correct responses in either language were scored as correct; it is unclear how this may have affected the results. Hansen et al. (Reference Hansen, Łuniewska, Simonsen, Haman, Mieszkowska, Kołak and Wodniecka2019) found moderate correlations between children's CLT scores and parental estimates of overall language proficiency, but only for the majority language, and not the minority language.

To investigate the validity of the task's construction, Hansen et al. (Reference Hansen, Simonsen, Łuniewska and Haman2017) investigated whether its two components, i.e., target words’ AoA and CI, were robust predictors of children's scores. This was the case for AoA, for both monolinguals and bilinguals, but not for CI. Likewise, Haman et al. (Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017) compared monolinguals’ performance on the CLT across 17 languages, and found moderate-to-strong negative correlations with AoA for at least one of the four subtasks in each language (Haman et al., Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017). In contrast, the CI revealed low-to-moderate negative correlations with children's scores for only one or two subtasks in seven of the languages and no significant correlations in the other 10 languages.

Despite little or no relation between children's scores and CI, Hansen et al. (Reference Hansen, Simonsen, Łuniewska and Haman2017) argued that the robust effect of AoA may nevertheless be enough to render cross-linguistically comparable CLT versions. Their observation that Polish and Norwegian age-matched monolingual children scored similarly supports this claim. Comparing monolingual children's performance across 17 languages, Haman et al. (Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017) found that only isiXhosa-speaking children knew significantly fewer words than the rest, most likely due to their comparatively low SES. When this group was removed from the analysis, a significant effect of language remained. This effect was rather small, and, as suggested by the authors, may have been caused by differences across languages in sample size and age range.

In sum, studies using the CLT have been able to replicate numerous findings from the literature, including the asymmetry for production versus comprehension, and nouns versus verbs, as well as effects of age, SES and relative amount of input on (bilingual) children's vocabulary scores. These studies have also found, however, that the rationale behind the CLT's construction is partially undermined by the lack of a reliable effect for the complexity index. Nonetheless, the CLT's construction procedure may still be successful in creating cross-linguistically comparable vocabulary tasks, as similar scores have been obtained for age-matched monolingual children, at least for Polish and Norwegian. Whether this holds for other language pairs, and whether the CLT correlates with other measures of children's language skills, remains an open question.

The present study

In this paper we report two studies investigating performance on the CLT by monolingual Dutch and Spanish children, and by bilingual Spanish–Dutch children, between the ages of 3 and 9. We chose a wide age range to assess the suitability of the CLT in and beyond the pre-school years. We investigate the validity of the CLT by (a) comparing the scores of the two monolingual groups, (b) comparing children's performance on the CLT with their performance on other measures of language proficiency, and (c) investigating the effects of the two factors underlying the CLT's construction, i.e., the target words’ estimated AoA and CI.

Haman et al. (Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017) suggested that phonological complexity may exert less influence on children's lexical development when they are older; for this reason, we also investigate whether there is an interaction between the CI and children's age at testing. In addition, we might expect an interaction between target words’ estimated AoA and children's age at testing: since most of the target words of the CLT have an estimated AoA lower than 6 (Łuniewska et al., Reference Łuniewska, Haman, Armon-Lotem, Etenkowski, Southwood, Anđelković, Blom, Boerma, Chiat, de Abreu, Gagarina, Gavarro, Hakansson, Hickey, de Lopez, Marinis, Popovic, Thordardottir, Blaziene, Sanchenz, Dabasinskiene, Ege, Ehret, Fritsche, Gatt, Janssen, Kambanaros, Kapalkova, Kronqvist, Kunnari, Levorato, Nenonen, Fhlannchadha, O'Toole, Polišenská, Pomiechowska, Ringblom, Rinker, Roch, Savic, Slancova, Tsimpli and Ünal-Logacev2016a), AoA should matter less for children beyond this age.

Study 1

Method

Participants

In the first study we tested 32 monolingual Dutch-speaking children (14 girls, 18 boys) and 32 monolingual Spanish-speaking children (17 girls, 15 boys), between the ages of 4;0 and 9;7. One additional Spanish-speaking child was tested, but later excluded because of exposure to English. The Dutch-speaking children were sampled from a public primary school in Gelderland, a province in the eastern part of the Netherlands, and the Spanish-speaking children were sampled from one public and one private primary school in Granada, in southern Spain. Ethical approval was obtained following the standard procedures in each location. All parents provided written informed consent prior to testing. The two groups were matched on age and parental education (see Table 2). Parental education was very high in the current sample, with almost all of the parents holding a bachelor's or master's degree.

Table 2. Monolingual children's age (years;months) and parental education (means and SDs).

^a Level of education was measured on a 5-point scale: 1 = primary education, 2 = secondary education, 3 = post-secondary non-tertiary education, 4 = bachelor or master's degree or equivalent, 5 = doctoral degree or equivalent.

Materials

CLT

The construction of the Dutch CLT was finalized in 2017 in consultation with the CLT developers (van Wonderen, Blom, Boerma, Janssen, Unsworth & van Dijk, Reference van Wonderen, Blom, Boerma, Janssen, Unsworth and van Dijk2017). For the current study, we adapted the Spanish production part of the CLT based on pilot data of monolingual six-year-olds (Myriam Cantú Sánchez, personal communication). Ten words that did not consistently evoke the target word in these pilot data, or constituted a phonologically identical cognate between Dutch and Spanish, were replaced by other words from the same difficulty level.

Although the CLT was originally intended for use with preschoolers, it has been used for monolingual children up to the age of seven (Haman et al., Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017) and for bilingual children up to a mean age of eight (Ringblom & Dobrova, Reference Ringblom and Dobrova2019). Because the present study includes children beyond that age, we decided to add 10 nouns and 10 verbs from the highest difficulty level in an attempt to prevent ceiling effects. Because there was only a limited number of words in the highest difficulty level – and the comprehension subtask requires the additional selection of three distractors per target item – we were only able to do so for the production subtasks. Thus, the total number of target items was 80 for production and 60 for comprehension. To ensure that none of the results presented in this paper were specific to the 40-item version of the production subtask, we additionally ran all analyses on the first 30 items only.

Sentence repetition

As a second measure of children's language proficiency, we included a sentence repetition task (SRT), designed to tap into children's overall language proficiency in both comprehension and production (Marinis & Armon-Lotem, Reference Marinis, Armon-Lotem, Armon-Lotem, De Jong and Meir2015), and shown to be particularly sensitive to children's lexical and morphosyntactic knowledge (Polišenská, Reference Polišenská2011).

The SRT used in this study was constructed in such a way that both language versions contained shared and language-specific structures, and that sentence complexity varied enough to accommodate the wide age range investigated. To achieve this, we selected 30 sentences in total from the Repeating Sentences subtasks of the Spanish and Dutch CELF-Preschool-2 (Wiig, Secord & Semel, Reference Wiig, Secord and Semel2009; Wiig, Secord, Semel & De Jong, Reference Wiig, Secord, Semel and De Jong2012), the CELF-4 (Wiig, Secord & Semel, Reference Wiig, Secord and Semel2006; Kort, Schittekatte & Compaan, Reference Kort, Schittekatte and Compaan2010) and from the Spanish and Dutch LITMUS-SRTs (Marinis & Armon-Lotem, Reference Marinis, Armon-Lotem, Armon-Lotem, De Jong and Meir2015). Care was taken to ensure that both language versions had an equal number of relatively simple and relatively complex sentences, and that both versions were comparable in terms of average sentence length (mean number of words was 9.0 (SD = 1.8) for Spanish and 9.6 (SD = 2.3) for Dutch, t(58) = 1.12, p = .27). See the Supplementary Materials (Supplementary Materials) for all items included in the SRTs.

Procedure

Children were tested individually in a separate, quiet room at school. The CLT was always administered before the SRT, to prevent children from hearing words in the SRT that they would have to produce in the CLT. For the CLT, the order of the production and comprehension tasks was counterbalanced across children. For the picture selection task, the experimenter would ask “Where is a [NOUN]?” or “Who is [VERB]?” and then children were instructed to point to – or say the number of – the corresponding picture. For the picture naming task, the experimenter would ask “What is this?” for nouns, and “What is the [boy/girl/woman/man] doing?” or “What is happening to the [NOUN]?” for verbs. When children were unclear about what was depicted in the picture they were asked to name, the experimenter tried to briefly describe it (e.g., for scales the experimenter would say “this can show you how heavy something is”). If children gave an incorrect answer, the experimenter would ask if they knew another word for the picture before continuing with the next picture.

For the SRT, sentences were pre-recorded by a female native speaker and inserted in the PowerPoint developed for the LITMUS-SRT (Marinis & Armon-Lotem, Reference Marinis, Armon-Lotem, Armon-Lotem, De Jong and Meir2015). Children wore headphones and were told they would go on a treasure hunt with Teddy the bear. In order for Teddy to move a step forward they had to repeat everything Teddy said. Children's responses were recorded with a digital voice recorder. Children were praised for trying to repeat the sentence, regardless of how well they did. When a child could not hear the sentence because of interruptions or loud noises, the sentence was played one more time after completion of the task. After each task, children could select a sticker. In total, testing took approximately 30 minutes.

Scoring

For the CLT, children's responses were noted down on the answer sheet by the experimenter.Footnote ¹ The original scoring guidelines for production (see Kapalková & Slančová, Reference Kapalková and Slančová2017) were deemed too generous because all responses were scored as correct as long as they contained the target's stem (i.e., including innovations, changes of word class, and one-stem answers for compounds, e.g. “sewing” for “sewing machine”); therefore, the Uppsala guidelines developed by Ute Bohnacker and colleagues were used instead (Bohnacker et al., Reference Bohnacker, Lindgren and Öztekin2016). This meant that an answer was scored as correct when it contained the target word or a (regional) synonym of the target word. Alternative answers were scored as correct when they corresponded with the picture and were at least as specific as the target word (e.g., peddelen “to paddle” for roeien “to row”). Mispronunciations were allowed if the word was still recognizable as the target (e.g., lipstip for lippenstift “lipstick”). For Spanish, answers containing a wrong determiner were counted as correct (e.g., ver el tele instead of ver la tele “watch television”).

Incorrect answers included words that were too general (e.g., limpiar “to clean” instead of aspirar “to vacuum”), words from the other language, innovations (e.g., zager “saw+suffix” instead of zaag “saw”), and words from a different word class (e.g., a noun in the verb subtask and vice versa; an exception was made in the case of periphrastic verbs with the verb hacer “to make/do” in Spanish; as this verb is already included in the elicitation question, ¿Qué está haciendo? “What is s/he doing?”, it is pragmatically correct to answer with only the noun). For Spanish, periphrastic verbs containing a preposition were scored as incorrect if the child used the wrong preposition (e.g. hablar con el teléfono “to talk with the phone” instead of hablar por teléfono “to talk on the phone”). For a complete list of accepted alternative answers, see the Supplementary Materials (Supplementary Materials).

For the analysis investigating whether the CI and AoA are predictive of children's scores, we only coded target responses as correct (i.e., disregarding synonyms or alternative correct responses), as the CI and AoA values are based on the target word only.

For the SRT, all children's responses were transcribed by (near-)native research assistants and these transcriptions were checked by the first author using the recordings. Responses were scored as correct if the sentence was (a) grammatical, and (b) contained the target structure. Semantic anomalies or incomplete sentences were scored as incorrect. For Dutch, grammatical gender errors resulting in the wrong determiner (de instead of het or vice versa), demonstrative pronoun (deze/die instead of dit/dat or vice versa) or relative pronoun (die instead of dat or vice versa) were allowed, as gender is acquired relatively late by monolingual Dutch children (e.g., Van der Velde, Reference Van der Velde2004). For Spanish, dative constructions without a clitic (Di un regalo a mi mamá vs. LE di un regalo a mi mamá “I gave a present to my mom”) were accepted, as the Real Academia Española (RAE, 2005) – a largely prescriptive language body and the official authority on the Spanish language in the Spanish-speaking world – considers it grammatical (though infrequent).

Responses were scored by two different coders in each language, who reached 94.3% agreement for Dutch and 91.4% for Spanish. Cohen's kappa for interrater reliability, which controls for the agreement expected by chance, was 0.88 and 0.79, respectively, which is high. In the case of disagreements, a third coder functioned as arbiter.

Analyses

To assess whether CI and AoA were predictive of children's scores (correct vs. incorrect), we performed Bayesian generalized linear mixed models in R (R Core Team, 2018) using the brms package (version 2.8.6; Bürkner, Reference Bürkner2017). While in most cases Bayesian and frequentist analyses lead to very similar conclusions when relatively uninformative priors are used (Albers, Kiers & van Ravenzwaaij, Reference Albers, Kiers and van Ravenzwaaij2018), a Bayesian approach has the advantage that it provides the entire posterior distribution with which we can say something about the (un)certainty of our estimate and what the 95% most credible values are. In other words, the claim that there is a 95% chance the true estimate is in the credible interval is warranted on a Bayesian approach but not on frequentist approaches (often leading to misinterpretations as researchers interpret frequentist confidence intervals as Bayesian credible intervals; see e.g., Kruschke & Liddell, Reference Kruschke and Liddell2018; Morey, Hoekstra, Rouder, Lee & Wagenmakers, Reference Morey, Hoekstra, Rouder, Lee and Wagenmakers2016).

The models included CI, AoA, word class (nouns vs. verbs), modality (comprehension vs. production), and children's age as fixed effects, as well as the interactions of CI and AoA with age, and the interactions between all previously mentioned predictors and group (Spanish vs. Dutch monolinguals). Factors were coded with sum-to-zero contrasts, comparing level estimates to the grand mean. Children's age was mean-centered, and CI and AoA were standardized. The model used a maximal random effects structure as recommended by Barr, Levy, Scheepers, and Tily, (Reference Barr, Levy, Scheepers and Tily2013), including random intercepts for participants and items, by-item and by-participant random slopes, and the correlations between by-item random effects. Correlations between by-participant random effects were removed as they were all close to zero and unnecessarily complicated the model. The model was fit using four chains, with 8,000 iterations each (2,000 warm-up). We used brms’ weakly informative default priors, and coefficients were deemed statistically “significant” if the associated 95% credible intervals were non-overlapping with zero, thus indicating a 95% chance that the effect of a predictor was different from zero. For factors, estimated marginal means were computed using the emmeans package (Lenth, Reference Lenth2019) to directly compare the factor levels to each other. Interactions were explored by follow-up models for the relevant subsets of the data.

To check for undue influence from items with no variation, we also ran the models on a reduced dataset excluding items on which all children scored correct (production: n _Dutch = 17/80, n _Spanish = 17/78;Footnote ² comprehension: n _Dutch = 39/60, n _Spanish = 46/59Footnote ²) or incorrect (production: n _Spanish = 2/78), according to the strict scoring scheme. The pattern of results remained the same.

Results

CI and AoA in the Dutch and Spanish CLT

Although the CLT is constructed in such a way that each language version contains an equal number of items from each difficulty level, it still may be the case that the versions differ in the mean AoA and CI of the items. For this reason, before examining children's responses, we compared the two CLT versions on these two variables (see Figure 1).

Figure 1. Boxplots of (a) AoA, and (b) CI, for the production and comprehension subtasks of the Dutch and Spanish CLT.

Analyses revealed that there were indeed some small differences between the two language versions in terms of mean CI and AoA (see Table 3). Mean CI differed for both subtasks, with a higher mean CI for Dutch than for Spanish in both the production and the comprehension subtask. In contrast, whilst mean AoA differed for the two languages in the comprehension subtask, with a higher mean AoA for Spanish than for Dutch, mean AoA in the production subtask did not.

Table 3. AoA and CI in the Dutch and Spanish CLT (means and SDs).

^a t-tests with Welch-adjusted degrees of freedom are reported to correct for unequal variance.

Monolingual children's performance on the CLT

Figure 2 presents children's scores on each subtask for each age group and language. In the Dutch group, there were 15 children aged 4 to 5, 10 children aged 6 to 7 and 7 children aged 8 to 9; for Spanish, the numbers per age group were 16, 8 and 8, respectively.

Figure 2. Boxplots of monolingual children's CLT scores divided into three age groups.

Children performed at ceiling on the comprehension subtask, although there was still some variation in the youngest age group, especially on verbs. When looking at production, we see that children's scores increased – and variability decreased – across age groups, but none were at ceiling.

Comparing Spanish and Dutch monolingual children's CLT and SRT scores

Children's scores on the CLT and the SRT are presented in Table 4, along with independent t-tests comparing the mean scores of the Dutch and Spanish monolingual children.

Table 4. Monolingual children's scores on the CLT and SRT (percentages).

^a t-tests with Welch-adjusted degrees of freedom are reported to correct for unequal variance.

When comparing the mean scores of the two monolingual groups, we found that the Spanish group significantly outperformed the Dutch group for the CLT production subtask, scoring on average 6.7% higher (which is equivalent to 2 to 3 of the 40 items). The two groups did not significantly differ in their comprehension scores, nor in their performance on the SRT.

The difference between the two groups for the production subtask of the CLT could not be attributed to the higher mean CI of the items in the Dutch as compared to the Spanish production subtask, as a logistic regression analysis in which CI was controlled for still revealed a significant effect of group, with the odds of answering correctly being 2.40 (95% CI = 2.03, 2.83) times higher for the Spanish than for the Dutch monolinguals, z = 10.3, p < .001 (see Supplementary Materials, Supplementary Materials).

When we divide the data into three different age groups (4–5 year-olds, 6–7 year-olds, and 8–9 year olds), we found that the differences between the two monolingual groups for the production subtask were almost entirely driven by the youngest age group, with the Spanish-speaking children scoring 12.1% higher than the Dutch-speaking children (equivalent to 4 to 5 of the 40 items), whereas there were no significant differences for the older age groups (all p’s > .05, see Supplementary Materials, Supplementary Materials).

Finally, we checked whether children's scores differed for the first thirty items (i.e., the official CLT), as compared to the final ten items that were added for this study and which were all from the highest difficulty level. As expected, children indeed scored lower on the final ten items (Dutch: M = 70.0%, SD = 19.1, Spanish: M = 65.3%, SD = 19.4), than on the first thirty items (Dutch: M = 79.3%, SD = 11.2, Spanish: M = 89.8%, SD = 5.9). Looking at these averages, one may also note that whilst the Spanish-speaking children as a group outperformed the Dutch-speaking children on the first thirty items, the Dutch-speaking children numerically outperformed the Spanish-speaking children on the last ten items. This means that the overall difference between the two groups reported above (cf. Table 4) was slightly attenuated by the addition of the ten extra items in this study (for the 30-item version, the Spanish-speaking children outperformed the Dutch-speaking children with 10.5% on average, equivalent to 3 to 4 of the 30 items; also compare Tables S5 and S6 in Supplementary Materials, Supplementary Materials).

Table 5. Bilingual children's age (years;months), relative exposure to Dutch, and relative use of Dutch in the home (%), and parental education (means and SDs).

Table 6. Bilingual children's scores (%) on the language proficiency tasks. Number of items was 40 per subtask for the CLT (production only), 30 for both SRTs, and 30 and 29 for the Dutch and Spanish CELF-WS task, respectively.

Correlations between the CLT and the SRT

While controlling for children's age, we found moderate correlations between children's SRT scores and their performance on the CLT's production and comprehension subtasks, for both Dutch (production: r(30) = .62, p < .001; comprehension: r(30) = .51, p = .003) and Spanish (production: r(30) = .74, p < .001; comprehension: r(30) = .52, p = .003).

Predictive power of CI and AoA

Estimated AoA was a strong predictor of children's scores, with the odds of children scoring correctly being 7.57 times lower for each standard deviation increase in AoA, 95% credible interval (95% CrI) [5.05, 11.66] (as AoA has a negative relationship with children's scores, the inverse odds ratio is reported for AoA throughout this paper for ease of interpretation). CI, on the other hand, was not a significant predictor of children's scores, odds ratio (OR) = 1.27, 95% CrI [0.88, 1.83]. Other significant predictors of children's scores included children's age, with the odds of a correct response becoming 1.05 times higher for a one month increase in age (95% CrI = 1.04, 1.07), subtask (nouns > verbs, OR = 3.20, 95% CrI = 1.38, 5.74), and modality (comprehension > production, OR = 48.2, 95% CrI = 21.2, 89.9). There were no significant interactions with group, and, crucial to our research questions, there were no interactions between age and CI (OR = 1.00, 95% CrI = 0.99, 1.01), nor age and AoA (OR = 1.01, 95% CrI = 0.996, 1.018).

When removing all items with zero variance (121/277 items), the pattern of results remained the same in the sense that AoA was still a strong predictor of children's scores (OR = 3.05, 95% CrI = 2.15, 4.41) whilst CI was not (OR = 1.28, 95% CrI = 0.90, 1.82), and there were no interactions with children's age. The same held for the analysis with the first thirty items only (AoA: OR = 6.85, 95% CrI = 4.46, 11.01; CI: OR = 1.08, 95% CrI = 0.71, 1.66).

To see if any of these results differed for production and comprehension, we ran a separate model including all interactions with modality. The only significant interaction was with CI, showing that CI did not have an effect on the production scores (OR = 1.10, 95% CrI = 0.70, 1.72), whereas an increase in CI seemed to have a positive, rather than a negative, effect on the comprehension scores (OR = 5.28, 95% CrI = 1.78, 22.05). However, the minimal variation on children's comprehension scores makes any estimate very sensitive to item- or participant-related noise, as there were very few items for which accuracy was not close to 100%. Because the results for comprehension may prove unreliable due to these ceiling effects, we also ran a model on the production data only. The pattern of results was the same as for the analysis on the complete dataset reported above.

Discussion

The findings so far suggest that the CLT is indeed indicative of children's language proficiency, as the correlation with the SRT suggests that both tasks are tapping into the same underlying construct. Furthermore, the CLT's construction was partly validated as AoA was a strong predictor of children's scores, in line with previous findings (e.g., Haman et al., Reference Haman, Łuniewska, Hansen, Simonsen, Chiat, Bjekić, Blažienė, Chyl, Dabašinskienė, de Abreu, Gagarina, Gavarró, Håkansson, Harel, Holm, Kapalková, Kunnari, Levorato, Lindgren, Mieszkowska, Montes Salarich, Potgieter, Ribu, Ringblom, Rinker, Roch, Slančová, Southwood, Tedeschi, Tuncer, Ünal-Logacev, Vuksanović and Armon-Lotem2017). In contrast, CI did not seem to be predictive of children's scores.

We also found that, whereas there was still some variability in comprehension scores in the youngest age group, the older age groups performed at ceiling. This was not the case for production, where even the oldest age group did not perform at ceiling. This was likely in part due to the addition of the ten extra items, as scores were on average lower for these than the first thirty items. At the same time, it is worth noting that the absence of ceiling effects could partly be explained by the effect of specific items: for Dutch, there were three verbs that were correctly produced by 2 or 3 children only (viz. liften “to hitchhike”, bedelen “to beg”, and vijlen “to polish”), and for Spanish there were two verbs that only 5 children named correctly (viz. taladrar “to drill”, and operar “to operate”).

The finding that the monolingual Spanish-speaking children scored significantly higher on the production subtask than the monolingual Dutch-speaking children was unexpected, and raises the question whether this can be explained by differences induced by the sample or by the task. In other words, the two groups may have differed on some other factor than age or parental education, such as (non)verbal intelligence, and this may have caused the difference in scores. Without any data specifically targeting such variables, we cannot rule out sample-related differences for certain. However, the fact that children's scores did not significantly differ for the SRT makes this explanation less likely. At the same time, mean AoA did not significantly differ for the Spanish and Dutch production subtask of the CLT, and the higher mean CI for the Dutch task could also not explain the difference, as controlling for CI did not change the group effect. If, for some reason, the Dutch production subtask is more difficult than the Spanish one, it is as yet unclear why this may be so. It does however raise some doubts about the validity of the CLT, which we consider in the General Discussion.

In sum, the correlation between the CLT and SRT seems to suggest that the CLT is indeed tapping into children's language proficiency, but the significant difference between the Dutch and Spanish monolinguals for the production subtask raises doubts about the validity of the CLT for comparing language proficiency across and – in the case of bilinguals – within children. In the second study we were again interested in the correlation between the CLT and other measures of language proficiency, as well as whether the CI and AoA would be predictive of children's scores. In this second study we tested a group of Spanish–Dutch bilingual children in the same age range as the children tested in the first study. Given the ceiling effects and very limited amount of variance on the comprehension subtask in the monolingual children, we decided to only administer the production subtask of the CLT. A subset of the bilingual children completed an additional, standardized language test targeting morphology and morpho-syntax as a third measure of their language proficiency.