Estimating reliability for response-time difference measures: Toward a standardized, model-based approach

Bronson Hui; Zhiyi Wu

doi:10.1017/S027226312300027X

Estimating reliability for response-time difference measures: Toward a standardized, model-based approach

Published online by Cambridge University Press: 29 May 2023

Bronson Hui

and

Zhiyi Wu

Show author details

Bronson Hui*: Affiliation:
Graduate Program of Second Language Acquisition, School of Languages, Literatures, and Cultures, University of Maryland, College Park, MD, USA
Zhiyi Wu: Affiliation:
Graduate Program of Second Language Acquisition, School of Languages, Literatures, and Cultures, University of Maryland, College Park, MD, USA
*: Corresponding author: Bronson Hui; Email: bhui@umd.edu

Article contents

Abstract
Introduction
Instrument reliability in SLA
Reliability for response-time differences
Potential sources of unreliability
Estimating reliability for response-time differences
The present studies
Study 1: Analysis of three open data sets
Study 2: Simulations
General discussion and conclusion
Competing interest
References

Rights & Permissions

Abstract

A slowdown or a speedup in response times across experimental conditions can be taken as evidence of online deployment of knowledge. However, response-time difference measures are rarely evaluated on their reliability, and there is no standard practice to estimate it. In this article, we used three open data sets to explore an approach to reliability that is based on mixed-effects modeling and to examine model criticism as an outlier treatment strategy. The results suggest that the model-based approach can be superior but show no clear advantage of model criticism. We followed up these results with a simulation study to identify the specific conditions in which the model-based approach has the most benefits. Researchers who cannot include a large number of items and have a moderate level of noise in their data may find this approach particularly useful. We concluded by calling for more awareness and research on the psychometric properties of measures in the field.

Type: Methods Forum
Information: Studies in Second Language Acquisition , Volume 46 , Issue 1 , March 2024 , pp. 227 - 250

DOI: https://doi.org/10.1017/S027226312300027X [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2023. Published by Cambridge University Press

Introduction

Applied second language (L2) researchers have been using tasks based on response time (RT) to tap into learners’ grammar and vocabulary knowledge (e.g., Elgort, Reference Elgort2011; Granena, Reference Granena2013) as well as individual differences in attributes such as procedural memory capacity (e.g., Buffington et al., Reference Buffington, Demos and Morgan-Short2021). Some of these measures, traditionally used to index online language processing in the psycholinguistics literature, have become commonplace in second language acquisition (SLA). This is in part because they are believed to tap into the knowledge that is available for automatic processing, a fundamental basis for authentic language use (e.g., Elgort, Reference Elgort2011; Suzuki, Reference Suzuki2017). Examples of such measures include word-monitoring tasks (Godfroid & Kim, Reference Godfroid and Kim2021; Granena, Reference Granena2013; Suzuki et al., Reference Suzuki, Jeong, Cui, Okamoto, Kawashima and Sugiura2022), self-paced reading (SPR) tasks (Fang & Wu, Reference Fang and Wu2022b; Godfroid & Kim, Reference Godfroid and Kim2021; Marsden et al., Reference Marsden, Thompson and Plonsky2018), and judgment tasks in the priming paradigm (Elgort, Reference Elgort2011; Hui et al., Reference Hui, Godfroid and Elgort2022a; Plonsky et al., Reference Plonsky, Marsden, Crowther, Gass and Spinner2020). In data analysis, researchers using these measures typically focus on differences in RTs across experimental conditions. In a word-monitoring or SPR task, for example, a slowdown in response (or processing) when encountering ungrammaticality is taken as evidence of the learner’s sensitivity to anomalies, which in turn is interpreted as a manifestation of the learner deploying the relevant grammatical knowledge online.

Although these tasks have been useful, applied researchers should exercise caution. An important reason is that these tasks are often used in the psycholinguistics literature to demonstrate a group-level effect, so the extent to which these tasks function well as an individual-difference measure in applied contexts remains an open question (Draheim et al., Reference Draheim, Mashburn, Martin and Engle2019). In this light, researchers must pay more attention to the fundamental psychometric properties of these tasks, such as reliability. However, the reliability of these measures is rarely reported in both SLA and neighboring fields such as cognitive psychology (e.g., Marsden et al., Reference Marsden, Thompson and Plonsky2018; Parsons et al., Reference Parsons, Kruijt and Fox2019; Plonsky & Derrick, Reference Plonsky and Derrick2016). Even when it is reported, the estimation method is not always detailed. As a result, how reliability has been and should be computed is a mystery, leaving applied researchers in a quandary because there is no obvious standard practice to follow. Such a lack of reference in the literature can severely limit researchers’ ability to make strong claims about what these tasks measure and the relationship between the object of measurement and other variables of interest (McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021). In this article, we take a step toward an informed, standardized approach to computing reliability for RT differences by evaluating three estimation methods first with three open data sets and then with simulated data sets. A secondary goal of this paper is also to improve researchers’ awareness of the importance of appropriately estimating the error associated with their instruments. Here, we present two studies, the first of which concerns a model-based approach to reliability for RT-difference measures. We then report a series of simulations that were based on the results of the first study to elucidate the specific conditions under which the model-based approach may have the most benefits.

Use of response-time difference measures

The ways in which language processing is investigated in psycholinguistics have inspired L2 researchers to apply online methods to address fundamental questions in SLA. For example, some SLA researchers have been interested in using these methods to measure implicit and automatized explicit knowledge (e.g., Bowles, Reference Bowles2011; Ellis, Reference Ellis2005; Vafaee et al., Reference Vafaee, Suzuki and Kachisnke2017) because learners can access this knowledge rapidly and effortlessly in real-time processing and fluent language use (Ellis, Reference Ellis2005). In this light, a direct way to tap into this kind of knowledge is to emphasize the processing of the learner in the measurement of linguistic knowledge. This emphasis has led to the rapid adoption of online measures in SLA research (e.g., Elgort et al., Reference Elgort, Brysbaert, Stevens and Van Assche2018; Godfroid, Reference Godfroid and Webb2020; Marsden et al., Reference Marsden, Thompson and Plonsky2018), especially when these tasks are now available from various data collection platforms (Patterson & Nicklin, Reference Patterson and Nicklin2023). Given the focus of the present paper on reliability, we limit our discussion to measures that involve a comparison of performance (typically response or processing times) between different experimental conditions.

First, Godfroid and Kim (Reference Godfroid and Kim2021) used both word monitoring and SPR tasks to measure implicit grammatical knowledge of six target structures (three morphological [e.g., third person -s] and three syntactic [e.g., embedded questions]). In the SPR task, for example, 131 English learners read grammatical or ungrammatical versions of stimulus sentences (e.g., The old woman enjoys reading many different famous novels (p. 615)). Participants were told to press a button that recorded the time that elapsed from the previous hit to proceed to the next word. Given that the spillover region (e.g., reading) followed the critical grammatical feature in question, participants were expected to show sensitivity to grammatical violations if they had implicit knowledge of the target structure. This sensitivity was operationalized and measured as a slowdown in processing when encountering ungrammaticality, compared with the grammatical baseline condition. In analytical terms, the difference in RTs between the grammatical and the ungrammatical trials represented an indication of implicit knowledge (e.g., Granena, Reference Granena2013; Maie & DeKeyser, Reference Maie and DeKeyser2020; Suzuki, Reference Suzuki2017; Suzuki et al., Reference Suzuki, Jeong, Cui, Okamoto, Kawashima and Sugiura2022).

In some cases, researchers expect to observe a speedup in the processing of SPR times. This approach can be based on consistent evidence for faster processing of formulaic language than matched control phrases (see Siyanova-Chanturia, Reference Siyanova-Chanturia2013, for an overview). This processing advantage is attributed to the entrenchment of these units in memory as a result of frequent exposures (Siyanova-Chanturia & Martinez, Reference Siyanova-Chanturia and Martinez2014). Therefore, faster reading times can be expected when learners read a sentence containing formulaic language than a matched control version. For example, Fang and Wu (Reference Fang and Wu2022b) investigated Chinese learners’ knowledge of the either–or construction in English. In an SPR task, participants read stimulus sentences such as Jay painted either | the big house | or the old car | for his family over the summer (p. 9), with or without either. The authors reported a speedup in reading times for the critical region (in this example, or the old car) in the trials that contained either, suggesting that these learners used their knowledge of formulaic construction to predict upcoming information in reading.

In addition to SPR, judgment tasks in the priming paradigm have also been used to index learners’ lexical knowledge (e.g., Elgort, Reference Elgort2011; Hui et al., Reference Hui, Godfroid and Elgort2022a). Priming often involves the presentation of a prime before a target. The prime is meant to influence the processing of subsequent linguistic information contained in the target due to prime–target orthographic, phonological, and/or semantic relationships (see Trofimovich & McDonough, Reference Trofimovich and McDonough2011, for an overview). Elgort (Reference Elgort2011) introduced this technique to the field as a measure of the representational aspects of vocabulary knowledge as a result of intervention in an applied context. In her study, participants learned pseudowords (e.g., obsolate) using flashcards. With three lexical decision tasks using different kinds of priming—namely, form priming, masked-repetition priming, and semantic priming—the author showed that deliberate learning of words can result in knowledge indexed by these implicit tasks. For example, in masked-repetition priming, participants made a faster lexical decision when the target was preceded by an identical prime (e.g., obsolate-OBSOLATE) than by an unrelated prime (e.g., mythical-OBSOLATE). Again, a faster RT in related trials, compared with unrelated trials, constituted a measure of “the formal-lexical representations of the stimuli” (p. 382). The formation of these representations was interpreted as learning products in the wake of deliberate word-learning activities.

So far, we have focused our discussion on indexing grammatical and lexical knowledge with RT difference tasks. In addition, these tasks can be used to measure attributes of individual differences such as procedural memory capacity (Buffington et al., Reference Buffington, Demos and Morgan-Short2021; Maie, Reference Maie2022). In an alternating serial RT task, for example, participants in the study by Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) saw a picture of a dog head filling a position in a row of four circles. The participants were instructed to press the button that corresponded to the position. In all experimental trials, the target location was dictated by a predetermined sequence unknown to the participant, alternated with random positions. Specifically, the odd-numbered trials were sequenced according to a pattern, whereas the even-numbered trials were random. The idea of the design was that if participants could learn the predetermined sequence through exposures, they were expected to respond faster in patterned, sequenced trials than in random trials. Thus, the ability to learn a hidden sequence, an attribute linked to procedural memory capacity, is quantified by a comparison of RTs in patterned versus random trials. This RT difference has been used in the literature as an individual-difference measure of procedural memory capacity (e.g., Brill-Schuetz & Morgan-Short, Reference Brill-Schuetz and Morgan-Short2014; Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018; Maie, Reference Maie2022).

Taken together, measures that involve differences in RTs have been adopted to index grammar and vocabulary knowledge, as well as cognitive attributes such as procedural memory capacity. Given the ubiquity of RT-difference measures in SLA, there is a need to consistently evaluate the measures that researchers begin to rely on, especially when they are applied to a novel context of investigation (e.g., adopted from psycholinguistics to SLA). A central aspect of such an evaluation is instrument reliability.

Instrument reliability in SLA

A cornerstone of quantitative research is measurement, which can be defined as the principled assignment of numerical values to objects, attributes, or events (e.g., Stevens, Reference Stevens1946). In SLA, linguistic knowledge in the L2 is perhaps the most important attribute to measure. The ways in which we measure knowledge such that, for example, a learner scoring higher on a grammar test possesses more or better grammatical knowledge than their peers is thus fundamental to our work. At the same time, measurement error is inevitable because researchers are often unable to tap into the constructs of interest directly (McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021). On this account, appropriately estimating the amount of error is critical because it allows researchers to understand the limitations of their instruments and represents one of the very first and most critical steps in evaluating a measure. Reliability, or consistency across measurements, has traditionally been regarded as a necessary condition for validity. As Davis (Reference Davis1992) put it more than three decades ago, “an unreliable measure cannot be valid” (p. 606). In other words, interpretations of scores largely assume that the individual demonstrates at least some consistency in their scores across independent measurements (American Educational Research Association et al., 2014). Indeed, McKay and Plonsky (Reference McKay, Plonsky, Winke and Brunfaut2021) argue that claims cannot be made about what is being measured and its relationship with other variables without sufficient evidence that the score in question is consistent at acceptable levels. Therefore, reliability deserves more attention in any scientific pursuit. Otherwise, researchers could be drawing conclusions without confirming their measurement is reliable, which renders the claims potentially more questionable than they should be.

From a statistical point of view, unreliability should also be addressed to the extent that is possible. When researchers build a general linear model, predictor variables are assumed to be measured without error. Using an unreliable measure to predict an outcome then violates this assumption. Even when the RT-difference measure is the outcome of the model, the variance in the outcome that is not explained by the model can be due to (1) its own unreliability, (2) the lack of sufficient explanatory power of the predictors, or (3) a mix of both. When an unreliable measure is used as either a predictor or the outcome, researchers are losing statistical power because the true relationship between the predictors and the outcome can be masked by error. This loss of power can be critical, especially when researchers in some subfields are already struggling to have sufficiently large sample sizes (Loewen & Hui, Reference Loewen and Hui2021; Vitta et al., Reference Vitta, Nicklin and McLean2021). However, if researchers are able to maximize instrument reliability, we can then focus on a more substantive search for factors that can explain a phenomenon examined by the researchers. In summary, instrument reliability plays an important role in quantitative research in SLA and should be considered in the constant evaluation of instruments.

Reliability for response-time differences

Despite the fundamental role of instrument reliability, L2 researchers often do not report the internal consistency of their tests (McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021; Plonsky & Derrick, Reference Plonsky and Derrick2016). Researchers using RT-difference measures, such as SPR and judgment tasks, are no exception (Marsden et al., Reference Marsden, Thompson and Plonsky2018; Plonsky et al., Reference Plonsky, Marsden, Crowther, Gass and Spinner2020). Although nonreporting does not necessarily imply low reliability (Plonsky & Derrick, Reference Plonsky and Derrick2016), there is a growing literature that should concern researchers, particularly those relying on RT differences.

First, in psycholinguistics, Tan and Yap (Reference Tan and Yap2016) reported shockingly low levels of reliability in masked-repetition and semantic priming. In their study, 240 native English speakers performed tasks within the masked-repetition and semantic priming paradigms in separate experimental blocks. The authors evaluated the consistency of the measurements with two reliability measures: split-half reliability and test-retest reliability (see McKay & Plonsky, Reference McKay, Plonsky, Winke and Brunfaut2021, for an overview of reliability measures). The former typically involves dividing data collected within one single experimental session into two halves (e.g., odd- and even-numbered items) before computing a correlation between the performance in the two subsets of data. The latter uses data gathered from the same participant but in separate test sessions. These two reliability estimates may seem similar at first sight, but discrepancies between the two have recently been reported, suggesting that the underlying factors influencing the level of reliability estimated by these approaches might differ (Oliveira et al., Reference Oliveira, Hayiou-Thomas and Henderson2022; West et al., Reference West, Shanks and Hulme2018). In Tan and Yap (Reference Tan and Yap2016), the correlation coefficients for the repetition priming ranged from .21 (Pearson, test-retest) to .43 (robust, split-half). For the masked semantic priming, these figures were between .05 (Pearson, split-half) and .17 (robust, split-half). The authors cautioned that “the unreliability … makes [the measures] a poor candidate for studying individual differences” (p. 195).

In SLA, Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) reported a split-half reliability of .42 for the alternating serial RT task discussed above, based on the performance of 119 participants. Relatedly, authors who rely on serial RT tasks to investigate statistical learning (outside of SLA) have also expressed serious concerns about the poor psychometric properties of the available individual-difference measures for statistical learning (Arnon, Reference Arnon2019; Lammertink et al., Reference Lammertink, Boersma, Wijnen and Rispens2020; Oliveira et al., Reference Oliveira, Hayiou-Thomas and Henderson2022; Siegelman et al., Reference Siegelman, Bogaerts and Frost2017; West et al., Reference West, Vadillo, Shanks and Hulme2017). For example, Arnon (Reference Arnon2019) examined three statistical learning tasks (two auditory and one visual) which were administered to both children and adults twice, with a two-month gap between administrations. The test-retest reliability estimates varied from .45 to .70, depending on the task in the adult data. When correlating the different tasks, which are meant to index a similar construct (i.e., statistical learning ability), the highest figure was .41. For the child data, the picture was even more gloomy, as the test-retest reliability for individual tasks ranged from .01 to .33, with the highest correlation between tasks at .33.

In cognitive psychology more generally, tasks believed to tap into cognitive abilities, such as executive functioning, also showed only moderate test-retest reliability, as reported in Hedge et al. (Reference Hedge, Powell and Sumner2018). For example, intraclass correlations were at .40 and .57 for the RT results of a flanker task, where participants indicated the direction of an arrow in the middle of others that point in the same (congruent) or different (incongruent) directions. Given this level of reliability, it might not be surprising that tasks that, again, are supposed to tap into the same construct (i.e., the flanker and Stroop tasks) correlated with each other at only .14 (Hedge et al., Reference Hedge, Powell and Sumner2018). Other studies have also warned researchers about this reliability issue with RT difference tasks in individual differences research in cognition (e.g., Paap & Sawi, Reference Paap, Johnson and Sawi2016; Verhaeghen & De Meersman, Reference Verhaeghen and De Meersman1998).

What merits special attention here is that many of these tasks have consistently been reported to elicit robust effects at the group level but the very same task is very unreliable when used to examine individual differences. This reliability paradox (Hedge et al., Reference Hedge, Powell and Sumner2018) urges researchers to pause and evaluate RT-difference measures. Indeed, a reviewer suggested that “the field of individual differences in cognition is experiencing somewhat of a measurement crisis.” Although the extent to which this statement is true remains an open question, a serious yet simple question must be asked: What are we measuring with these tasks after all?

Potential sources of unreliability

Given the general unreliability, researchers should seek to understand the sources of unreliability. First and foremost, the unreliability can be due to substantive factors that can be related to the particular processes that the tasks seek to examine. For example, Tan and Yap (Reference Tan and Yap2016) argued that the psycholinguistic mechanisms underlying semantic priming are controlled, as opposed to automatic, in nature. Therefore, the performance of the same individual can vary greatly, causing inconsistency in the measurement. West et al. (Reference West, Shanks and Hulme2018) suggested that the complex nature of the procedural processes, relative to processes related to declarative memory, is the reason why tasks used for assessing procedural learning (i.e., the serial RT tasks) were significantly less reliable than the tasks for declarative learning (i.e., free recall tasks) in their data. In other cases, researchers believe that some of these tasks are not measuring what it is designed to after all; instead, they might be tapping into theoretically less interesting constructs, such as processing speed and strategies (e.g., Hedge et al., Reference Hedge, Powell, Bompas and Sumner2022; Miller & Ulrich, Reference Miller and Ulrich2013; Rouder et al., Reference Rouder, de la Pena, Pratte, Richards, Hernan, Pascoe and Thapar2022).

The second source of unreliability can be statistical. Specifically, the low reliability can be due to a lack of sufficient variance between participants (e.g., Clark et al., Reference Clark, Birch-Hurst, Pennington, Petrie, Lee and Hedge2022; Hedge et al., Reference Hedge, Powell and Sumner2018). This means that participants are not different enough to show reliable individual differences. This may be due to homogeneous sampling and the nature of the attribute being measured in which individuals do not differ much (Hedge et al., Reference Hedge, Powell and Sumner2018). This can also be caused by the design of the task. For example, it can be too easy for the sample and thus, there is a ceiling effect; and/or it can be that there is little variability in item difficulty (Clark et al., Reference Clark, Birch-Hurst, Pennington, Petrie, Lee and Hedge2022; Hedge et al., Reference Hedge, Powell and Sumner2018). Related to task design, intuitively, data might have some random noise if the participant is tested from home and the set-up of the participant’s technology plays a critical role (e.g., Patterson & Nicklin, Reference Patterson and Nicklin2023).

Finally, the source of unreliability can also be computational. That is, the way in which the RT-difference data are preprocessed and analyzed could have contributed to the low level of consistency. For example, Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) speculated that computing difference scores between trial types may be “the source of low reliability” (p. 647). In the analysis code that these authors shared, they applaudingly documented their thought process in pinpointing the cause of inconsistency. They wrote that outliers “needed to be addressed,” and the outlier treatment strategy they had employed “did seem to help.” Among all potential sources, the computation of consistency levels is possibly the one that applied researchers, who use these measures to address their substantive research questions (as opposed to methodologists who examine these methods), can act on because they can compute reliability based on an informed approach. If that is the case, the natural question is then how researchers should estimate reliability for RT-difference data.

Estimating reliability for response-time differences

As mentioned, there is a lack of consensus on how to estimate internal consistency for RT-difference data. Here, we discuss three methods: computing RT differences based on (1) raw RTs, (2) by-participant z-transformed RTs, and (3) estimates of RT differences based on mixed-effects modeling.

First, an intuitive approach is to compute the RT difference for each item across two trial types (e.g., related vs. unrelated or grammatical vs. ungrammatical). For example, when a slowdown in processing is expected for grammatical violations, a 500-ms response on a grammatical trial and a 550-ms response on an ungrammatical trial translate into a 50-ms slowdown. This difference can be used to index one’s grammatical sensitivity. What Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) argued is that this difference score, aggregated across items, is not reliable. The unreliable nature of difference (or change) scores has long been a concern in social sciences (e.g., Cronbach & Furby, Reference Cronbach and Furby1970; Gulliksen, Reference Gulliksen1950). Similar arguments have been made more recently in the context of RT research (Draheim et al., Reference Draheim, Mashburn, Martin and Engle2019). One reason for such an inconsistency is that subtraction can reduce the between-participant variance relative to error variance (e.g., Hedge et al., Reference Hedge, Powell and Sumner2018). In other words, by subtracting the RT in one trial from that in another trial, the researcher removes the useful, common information carried by the two RTs that makes the individual participant unique in the sample (i.e., the between-participants variance). What is left then is random variation within the individual (i.e., within-participant variance), contributing to the overall unreliability. Another criticism of a raw difference is that it does not adequately account for baseline differences between individuals (e.g., Tan & Yap, Reference Tan and Yap2016). For example, a 50-ms slowdown indexes somewhat different levels of change for a learner whose baseline RT was 500 ms versus their peer whose baseline was, say, 300 ms. In this respect, taking one’s baseline into account could yield better reliability.

One way to do that is the second approach that we discuss here. Following this approach, researchers first compute z-score-transformed RTs, based on each participant’s own mean and standard deviation (Hutchison et al., Reference Hutchison, Balota, Cortese and Watson2008; Tan & Yap, Reference Tan and Yap2016). This step allows researchers to put participants on an equal footing because all participants have a mean RT of zero and the change scores (RT differences) are expressed in their own standard deviation unit. In this way, the baseline RT for each participant is accounted for in the computation, which addresses the limitations of using raw RTs.

However, there are two additional points to note here: First, following either of these approaches, researchers may need to decide on their handling of missing data. The RTs can be coded as missing when participants respond incorrectly or when the RT is outside the data-trimming threshold. Indeed, (applied) psycholinguists often adopt a counterbalanced design where each participant sees only one of two versions (e.g., grammatical) of the item, in order to avoid participants being exposed to very similar trials and thus creating unwanted confounding. All these situations are not uncommon in the analysis of data in psycholinguistics. One way to handle missing data in estimating reliability is thento discard the item when one of the two RTs is missing. This can result in discarding more data than necessary and has implications for statistical power and the intended inference researchers wish to make. The second way to get around the problem is to perform an aggregation before subtraction. That is, each participant has a mean RT for each of the two trial types before the computation of an RT difference. If there were no missing data, the two orders of operation (i.e., aggregation before subtraction and subtraction before aggregation) yield identical results mathematically. However, with missing data, aggregation before subtraction means that average RTs likely result from different items. Although averaging across (a large number of) items should generally lead to more accurate results, one might still question the extent to which the items that go into the computation for both trial types are similar enough to warrant a direct comparison.

Second, aggregating an effect across items, be it before or after subtraction, should remind applied psycholinguists of how researchers used to carry out separate by-participant and by-item analyses for RT data, which are no longer recommended (e.g., Baayen et al., Reference Baayen, Davidson and Bates2008). The reason is that by-participant aggregations essentially ignore the variability in the effects associated with the item and vice versa. The more contemporary approach is to simultaneously model participant and item variability by implementing a mixed-effects model (e.g., Baayen et al., Reference Baayen, Davidson and Bates2008). Generally, a mixed-effects model results in more accurate estimates and represents a more parsimonious analysis of the data. It is not an exaggeration to suggest that applied psycholinguists are already intimately familiar with the technique.

In the present context of estimating the reliability for RT differences, the use of such a model is rare in SLA. This approach represents the third approach we discuss here: the model-based approach. Not adopting this approach can be seen as a missed opportunity because, again, many researchers are already familiar with these types of models. Moreover, it has the advantage of simultaneously modeling random effects associated with both item and participant, accounting for dependency in the data as a result of individual-difference factors specific to the participant (e.g., processing speed) and characteristics specific to the item (e.g., frequency). Perhaps less discussed in SLA is also the ability of a mixed-effects model to handle missing data through (restricted) maximum likelihood (e.g., Hox et al., Reference Hox, Moerbeek and van de Schoot2018), addressing the missing data challenge discussed above. Therefore, a model-based approach should be a promising candidate to arrive at more accurate reliability estimates.

One way to understand a mixed-effects model is to imagine that a regression line is fitted for each participant and for each item (level-2 units). That means that every level-2 unit has its own intercept and slope terms (when they are allowed to vary). When there are, for example, 40 participants in the data set, there can be 40 intercept and 40 slope values, as well as a correlation between them. Simultaneously, the fixed effects, which researchers often interpret, are computed by the algorithm, taking into account these random effects. The critical information here in relation to reliability assessment is this: The by-participant random slope for each individual represents a model-based summary of the main effect (slowdown or speedup) specific to the learner. Therefore, the by-participant random slopes can be seen as individualized difference scores, after accounting for all relevant random effects. The use of by-participant random-slope values as a basis for estimating reliability, on this account, should produce better results.

We should note, at this point, that we are not the first to suggest that mixed-effects models can be used to estimate the reliability of RT-difference measures. Previous authors (e.g., Rouder & Haaf, Reference Rouder and Haaf2019) have already suggested the same. Rouder and Haaf (Reference Rouder and Haaf2019) were interested in the cause of the low correlation between two attentional control tasks, the Stroop and flanker tasks, as reported by Hedge et al. (Reference Hedge, Powell and Sumner2018). They retested the data of Hedge et al. (Reference Hedge, Powell and Sumner2018) and reported better test-retest correlations based on model estimates. Overall, the authors observed an increase of around .20, compared with the non-model-based sample correlations. Given the promising evidence of Rouder and Haaf (Reference Rouder and Haaf2019), it is high time to test the extent to which a model-based approach to estimating reliability is appropriate for the RT-based L2 data.

In addition, the model-based approach offers an unexplored opportunity to treat outliers in reliability assessments. As discussed, Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) pointed out in their analysis code that outliers can be detrimental to the overall instrument reliability. Although the authors handled outlier RTs using more conventional strategies based on means and standard deviations of individual participants, the extent to which a model-based approach might offer better results remains an open question. In RT-based research, Baayen and Milin (Reference Baayen and Milin2010) have shown the benefits of engaging in model criticism as a way to trim RT data, although model criticism is not always used to handle outliers in analyzing SPR data (e.g., Marsden et al., Reference Marsden, Thompson and Plonsky2018). This approach amounts to fitting an initial mixed-effects model to the raw data to first identify and remove observations with a large, standardized residual (e.g., an absolute value larger than 2.5, Baayen & Milin, Reference Baayen and Milin2010). With the trimmed data set, researchers refit the model to the data with the same model specifications. According to these authors, the refitted model almost always has a better fit, with fewer observation removals needed. On this basis, the reliability resulting from this procedure should represent a more accurate estimate because the better model fit should produce more accurate by-participant random slopes. Therefore, the application of this technique could further improve the estimated reliability of the RT differences under investigation.

The present studies

Considering this review, we formulated two research questions for our first study to assess the model-based approach to estimating reliability:

RQ1: To what extent does a model-based approach yield more reliable RT differences?

RQ2: To what extent does model criticism as an outlier treatment strategy yield more reliable RT differences?

Based on the results of the first study, we further addressed RQ3 in our second simulation study to examine the boundaries of the model-based approach:

RQ3: Under what conditions, in terms of the number of items and level of error, is a model-based approach more beneficial in estimating the reliability of RT differences, than non-model-based approaches?

Taken together, our current attempt represents an important and ethical step for evaluating RT-difference measures that SLA researchers begin to increasingly rely on, moving beyond simply accepting the face value of our instruments without scrutinizing (much) their reliability and validity (Cohen & Macaro, Reference Cohen, Macaro and Macaro2013).

Study 1: Analysis of three open data sets

Methodology

To address our first two research questions, we took advantage of open data shared by L2 researchers. Three data sets, including data from an SPR task, a lexical decision task in masked-repetition priming, and an alternating serial reaction time task, were used to estimate the reliability of RT differences. For RQ1, we tested three computational approaches: RT differences based on (1) raw RTs, (2) by-participant z-transformed RTs, and (3) model-based estimates of RTs. Addressing RQ2, we compared trimming strategies based on mean and standard deviations (following the initial authors of the data sets) with the implementation of model criticism. In the spirit of open science, all R code (R Core Team, 2022) used for data analysis is made available in the Open Science Framework (https://osf.io/cd5r8/).

Data sets

Here, we provide minimal background information to understand the contexts from which the original data were collected. Interested readers are referred to the substantive publications associated with the data sets.

The first data set was shared by Fang and Wu (Reference Fang and Wu2022a), available on Open Science Framework (https://osf.io/abhjv/). The associated substantive publication was Fang and Wu (Reference Fang and Wu2022b). The authors administered SPR and acceptability judgment tasks to investigate learners’ (N = 122; 135 initially, 13 removed) knowledge of the either–or construction in English. We used only the SPR data involving learners (bind_SPR_English_L2_version1-2.csv). Participants read, in a self-paced manner, two versions of 20 sentences: one with either (e.g., Jay painted either | the big house | or the old car | for his family over the summer) and one without (e.g., Jay painted | the big house | or the old car | for his family over the summer). The authors reported a significant speedup in reading times in the critical region (i.e., or the old car) on the trials that included either compared with those without, suggesting that learners can use knowledge of the either–or construction to make predictions in reading. Given that the claims made by the researchers were largely based on a speed up in the critical region, we also focused on this region in our analysis as well.

The second data set we used was made public by Hui et al. (Reference Hui, Godfroid and Elgort2022b), available on Open Science Framework (https://osf.io/uyfh5/) under the Creative Commons Attribution 4.0 International Public License. The associated substantive publication was Hui et al. (Reference Hui, Godfroid and Elgort2022a). In their study, the authors administered a set of four vocabulary tests to 129 (144 initially, 15 removed) advanced English learners at an American university. We used only the subset of data for masked-repetition priming (i.e., data_exp_35094-v18_task-uoc2.csv). As discussed, this task was used as a vocabulary measure to index the extent to which the lexical entries of a sample of target words (K = 40) had been established in the mental lexicon. In the 80 critical trials (two for each of 40 items), participants were exposed to a prime presented very briefly (55 ms) and forward masked by a string of hashtags (####) for 500 ms. Immediately after, the participant made a lexical decision on the target presented in the upper case to indicate whether or not it forms an English word. The authors reported a group-level priming effect in which participants responded faster to the target in the related, identical trials (e.g., patience-PATIENCE) than to the unrelated trials (e.g., occasion-PATIENCE). The authors also reported that such priming was not observed for their nonword data, providing further evidence for the validity of the measure.

The third data set was shared by Buffington and Morgan-Short (Reference Buffington and Morgan-Short2022), available on Open Science Framework (https://osf.io/ux4qs/). The associated substantive publication was Buffington et al. (Reference Buffington, Demos and Morgan-Short2021). The authors performed a total of six memory assessments (three for procedural memory and three for declarative memory). We used only the data set for the alternating serial reaction time task (i.e., ASRT_MasterData.csv). This task tested the ability of the participants (N = 99; 119 initially, 20 removed) to acquire an implicit, patterned sequence in the task, which is associated with the use of procedural memory. As reviewed, participants pressed a button corresponding to the location of a dog’s head appearing in one of four circles. The sequence followed a second-order pattern in that the patterned trials alternated with the random trials. There were a total of 20 experimental blocks, each of which had 85 trials (five random trials to start, followed by 80 alternating patterned and random trials).

Analysis

For each data set, we computed a total of 22 split-half correlations, following the three computational approaches with and without a trimming procedure (RQ1). In the case of the model-based approach, we tested an additional trimming method—namely, model criticism (RQ2). For the first two computational approaches (raw and z scores), we implemented both by-participant and by-item analyses. For all reliability estimates, we performed two correlation tests (Pearson and robust) for each analysis method (see Figure 1).

Figure 1. The three computational approaches and the corresponding calculations.

To analyze the data, we first repeated all accuracy-based screening procedures following the original authors. We also removed practically impossible RTs, as defined by the authors (negative RTs for Fang & Wu, Reference Fang and Wu2022a; 300 ms for Hui et al., Reference Hui, Godfroid and Elgort2022b; 100 ms for Buffington & Morgan-Short, Reference Buffington and Morgan-Short2022). This treatment was to remove completely unusable data and differed from the trimming procedure that seeks to rid the data sets of outliers. Although Fang and Wu (Reference Fang and Wu2022b) logarithmically transformed and residualized the RTs on the length of the region before their further analysis, we used the “raw” RTs in our analysis for consistency across the three data sets. As a sensitivity analysis, we confirmed that using the logged, residualized RTs did not change our conclusion. The resulting data sets from these preliminary processing steps represented the untrimmed data sets for further computation.

To create the trimmed data sets, we also deleted RTs that were above the authors’ upper threshold (2.5 standard deviations from the learner’s mean for Fang & Wu, Reference Fang and Wu2022a; 2500 ms for Hui et al., Reference Hui, Godfroid and Elgort2022b; and 3.0 standard deviations from the participant’s mean for Buffington et al., Reference Buffington, Demos and Morgan-Short2021).

As a next step, we split each data set into two halves (i.e., odd-numbered and even-numbered items). For Buffington and Morgan-Short (Reference Buffington and Morgan-Short2022), we needed to assign item numbers to alternating trials so that every two back-to-back trials (one patterned and one random) was considered a duplet item. Thus, we addressed issues associated with learning that takes place during the task that can lead to serial dependence among trials (see more discussion in Buffington et al., Reference Buffington, Demos and Morgan-Short2021).

To take our first computation approach, we calculated the raw difference in RTs for each item. We aggregated it across both participants and items, such that each item and each participant have a mean RT difference, respectively. To estimate split-half reliability, we computed both Pearson and robust correlations with the two halved data sets, using the cor.test( ) function in the stats package (R Core Team, 2022) and the pbcor( ) function in the WRS2 package (Mair & Wilcox, Reference Mair and Wilcox2020).

For the second approach, we transformed the RTs into z scores based on each participant’s own mean and standard deviation. After that, we computed the difference and aggregated it across items and between participants. The correlation tests were performed the same as in the first approach.

In terms of the model-based approach, we constructed two separate models for each of the two halved data sets. For all three data sets, the outcome was specified as the inverse of RT (-1/RT), following Hui et al. (Reference Hui, Godfroid and Elgort2022a). We used a maximal random-effects structure, including all relevant random intercepts and slopes, as well as their correlation (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). To resolve issues with singular fit and nonconvergence, we used the nloptwrap optimizer from the optimx package (Nash & Varadhan, Reference Nash and Varadhan2011) and the partial Bayesian method implemented in the blme package (Chung et al., Reference Chung, Rabe-Hesketh, Dorie, Gelman and Liu2013) to force the relevant random-effects matrices away from singularity. From the models, we extracted the by-participant random slope for each participant before we correlated the slopes across the two halves of each data set.

Finally, we tested whether and how model criticism as an outlier treatment strategy might be useful (RQ2). To engage in model criticism, we first used the residuals of the models fitted to the untrimmed data as an outlier identification strategy and then removed observations that had a standardized absolute residual greater than 2.5 (Baayen & Milin, Reference Baayen and Milin2010). We refitted the models to these trimmed data sets using the same specifications and extracted the by-participant random slopes for the correlation tests.

As an additional note, whenever a correlation test returned an unexpected negative value, we followed Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) in applying a correction according to Krus and Helmstadter’s (Reference Krus and Helmstadter1993) Equation 15—namely, $ corrected\hskip0.35em r=\frac{-{r}_{ab}}{.5\;\left(1-{r}_{ab}\right)} $ . We have marked the cases where the correction was applied in our results.

Results

We present the correlation coefficients in Tables 1 to 3. Before we address our research questions, one striking observation is that the correlation coefficient can vary hugely depending on the data analysis undertaken. With the same data set (e.g., Hui et al., Reference Hui, Godfroid and Elgort2022b), the coefficient ranged from .02, indicating essentially no association between what was measured in the two halved data sets, to .88, a satisfactory level of reliability. This variability confirmed the dire need to develop a more informed and standardized approach to estimating reliability.

Table 1. Split-half correlations for the Fang and Wu data set

Note: # indicates that a correction was applied to a negative coefficient according to Krus and Helmstadter (Reference Krus and Helmstadter1993). We do not report 95% CIs and p values because it is not clear whether a correction for these statistics is needed and, if so, how to compute them.

Table 2. Split-half correlations for the Hui et al. data set

Table 3. Split-half correlations for the Buffington and Morgan-Short data set

In terms of RQ1, concerning the usefulness of the model-based approach, only the results for the Hui et al. (Reference Hui, Godfroid and Elgort2022b) data showed a superiority of the model-based approach over the other two approaches. The reliability was in the .30 range at best with non-model-based approaches. However, when the by-participant random slope was used as a basis for reliability estimation, the figures were mostly in the .80 range, suggesting satisfactory levels of internal consistency. In contrast, neither the Fang and Wu (Reference Fang and Wu2022a) data nor the Buffington and Morgan-Short (Reference Buffington and Morgan-Short2022) data presented a clear pattern of which analysis method had the greater advantage in reliability. The relatively higher levels following the model-based approach could also be achieved using other approaches.

For RQ2, we focused on comparing the correlations computed from the trimmed data (using conventional strategies) and from model criticism. Model criticism as an outlier treatment strategy appeared to have improved the figures of both the Hui et al. (Reference Hui, Godfroid and Elgort2022b) and the Buffington and Morgan-Short (Reference Buffington and Morgan-Short2022) data sets. In the case of the Hui et al. data, the improvement was not as marked, potentially due to the already high level of correlations (.81 and .85). No advantage was observed for the Fang and Wu data.

Discussion

In our analysis of the three open data sets, we found that the model-based approach was useful to varying degrees. Overall, the model-based approach only showed a clear advantage for the Hui et al. (Reference Hui, Godfroid and Elgort2022b) data set. This was, at least initially, surprising to us. Here, we first discuss why the model-based approach demonstrated exceptional performance with this data set. Then, we consider why this advantage did not generalize to the other two data sets, which motivated our simulation study that follows.

Repeatedly, we have stressed that mixed-effects models can simultaneously model variability caused by characteristics of both individual items and participants (Baayen et al., Reference Baayen, Davidson and Bates2008). Therefore, the benefits of the mixed-effects approach should be most obvious when there is a reasonable amount of participant and item variability to be accounted for. In the Hui et al. (Reference Hui, Godfroid and Elgort2022b) data, the authors used the masked-repetition priming as a vocabulary test. Therefore, they sampled word stimuli across four frequency bands. Put differently, words included in the experiment were of various levels of difficulty. Highly frequent words (e.g., upset) should elicit much faster responses than items at a lower frequency band (e.g., miniature). This means that the intercept terms, representing the baseline RT for each item, can vary to a large extent. Relatedly, the level of priming might also vary because there is not a lot of room for a drastic speedup if the baseline RT is already fast. Therefore, properly modeling item variability (in addition to participant variability) has proven to be the right strategy because the accuracy of the estimates was improved through a phenomenon known as shrinkage or regularization, a primary property of mixed-effects models (Baayen et al., Reference Baayen, Davidson and Bates2008; Winter, Reference Winter2019). More accurate estimates also mean that the person-related parameters are closer to their true values (e.g., baseline RTs [intercept] and priming effects [slope]); therefore, the improvement in the split-half reliability was especially obvious for this data set.

In contrast, the Buffington and Morgan-Short (Reference Buffington and Morgan-Short2022) and Fang and Wu (Reference Fang and Wu2022a) data sets did not have a large amount of item variability. In Buffington and Morgan-Short (Reference Buffington and Morgan-Short2022), there was little variability in terms of difficulty because there were consistently four positions in which the dog head could appear. The same was true for the study by Fang and Wu (Reference Fang and Wu2022b), who focused only on one single grammatical structure (i.e., the either–or construction). Perhaps, when there is little item variability to be accounted for in the first place, the design of the task might prevent the mixed-effects models from achieving their full potential. In addition, the number of items might have played a role. In the Buffington and Morgan-Short data, there were 800 items (or 1,600 trials). Aggregating across such a large number of items yields very accurate results, as manifested by the higher levels of reliability in the by-participant analysis than in the by-item analysis (see Table 3). Therefore, the aggregation-and-subtraction approaches were not disadvantaged due to the sample size of the items. Finally, the overall level of error should be a determining factor because the model-based approach is not a magic wand that can remove all error and make an unreliable measure suddenly reliable. What mixed-effects models are capable of is properly partitioning variance, thus accounting for variance due to participant and item that could have been regarded as error variance contributing to the unreliability of the measure. However, it cannot remove random error from the data set.

In light of these accounts, we performed a simulation study to show the effects of (1) item number and (2) level of error on the usefulness of the model-based approach to estimating reliability for RT differences (RQ3).

Study 2: Simulations

In the previous analyses, we observed that only the Hui et al. (Reference Hui, Godfroid and Elgort2022b) data benefited from a model-based approach to estimating reliability. The overall objective of this simulation study was to explore the strengths and limitations of the model-based approach. It is also through this study that we address our RQ3, which we repeat here for easier reference: