Don't let perfect be the enemy of better: In defense of unparameterized megastudies

Wei Li; Joshua K. Hartshorne

doi:10.1017/S0140525X23002121

Don't let perfect be the enemy of better: In defense of unparameterized megastudies

Published online by Cambridge University Press: 05 February 2024

Wei Li and

Joshua K. Hartshorne

Show author details

Wei Li: Affiliation:
Department of Psychology & Neuroscience, Boston College, Chestnut Hill, MA, USA liacz@bc.edu, https://www.weiiir.xyz/ joshua.hartshorne@hey.com, https://l3atbc.org
Joshua K. Hartshorne*: Affiliation:
Department of Psychology & Neuroscience, Boston College, Chestnut Hill, MA, USA liacz@bc.edu, https://www.weiiir.xyz/ joshua.hartshorne@hey.com, https://l3atbc.org
*: *Corresponding author.

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

The target article argues researchers should be more ambitious, designing studies that systematically and comprehensively explore the space of possible experiments in one fell swoop. We argue that while “systematic” is rarely achievable, “comprehensive” is often enough. Critically, the recent popularization of massive online experiments shows that comprehensive studies are achievable for most cognitive and behavioral research questions.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 47 , 2024 , e53

DOI: https://doi.org/10.1017/S0140525X23002121 [Opens in a new window]
Copyright: Copyright © The Author(s), 2024. Published by Cambridge University Press

Almaatouq et al. provide an incisive and welcome critique of the dominant one-at-a-time paradigm. They argue for integrative studies that systematically and comprehensively explore the “universe of possible experiments” (target article, sect. 2.2, para. 1) in each domain of inquiry. While we are sympathetic to the goal, Almaatouq et al. overemphasize systematic at the expense of comprehensive.

As we see it, the core problem with the one-at-a-time approach is that it is too slow. It is not news that most studies extrapolate broadly from a miniscule sample of stimuli, subject demographics, and experimental paradigms, resulting in a long-running generalizability crisis (Clark, Reference Clark1973; Henrich, Heine, & Norenzayan, Reference Henrich, Heine and Norenzayan2010; Judd, Westfall, & Kenny, Reference Judd, Westfall and Kenny2012; Yarkoni, Reference Yarkoni2022). Even large literatures often fail to do much more than scratch the surface of possibilities (e.g., Hartshorne & Snedeker, Reference Hartshorne and Snedeker2013; Peterson, Bourgin, Agrawal, Reichman, & Griffiths, Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021). It is as if we set out to explore the universe of possible experiments, but spent most of our time hanging out at the hotel pool.

In principle, Almaatouq et al.'s integrative experiment approach is ideal: Find the set of parameters that describe the universe of possible experiments and then survey systematically. Unfortunately, this requires better understanding of the phenomenon than we usually have. Indeed, the cognitive and behavioral sciences remain largely in Kuhn's preparadigmatic phase (Kuhn, Reference Kuhn2012), characterized by conflicting and incommensurate theories, each with its own set of assumptions, methods, and observations.

Let us illustrate the difficulty with an easy-to-articulate question: Are children better at learning all aspects of syntax or just certain parts? Answering this question requires comparing how quickly older and younger learners learn each component of syntax. The problem is that different theories propose radically different visions of what syntax is and how it might be subdivided. Theories differ in terms of whether syntax is governed by large numbers of highly articulated, abstract rules that combine structurally simple words; by a small number of simple rules that combine internally complex words; or by superimposed, prototype-like patterns, with no distinction between words and rules; among other possibilities (Chomsky, Reference Chomsky2014; Goldberg, Reference Goldberg1995, Reference Goldberg2009; Hopcroft, Motwani, & Ullman, Reference Hopcroft, Motwani and Ullman2001; Steedman, Reference Steedman2001). Even where theorists agree on the structure, they disagree on processing, with the same grammatical patterns subserved by different cognitive/neurological systems at different times or by different people (O'Donnell, Reference O'Donnell2015; Ullman, Reference Ullman2001).

To make matters worse, none of these are complete theories that can be applied to arbitrary stimuli. For starters, predictions depend on ancillary assumptions that are left as open empirical questions (such as the relative preference for generalizations vs. one-off computations). More problematically, different theories often prioritize explaining distinct phenomena (common or rare utterances; highly productive patterns or semi-idiomatic expressions; early child language or mature usage; similarities across languages or cross-linguistic differences); one theory may not make clear predictions about the core motivating phenomena for another, and vice versa. While we believe this reflects the difficulty of the problem, not the diligence of the researchers, the outcome is the same: There is a lot of theoretical progress lying between here and the integrative experiments proposed by Almaatouq et al.

Even our example above underestimates the problem. Almaatouq et al. focus primarily on sampling from a stimulus space, not a task space. Two of their examples (trolley problems and risky-choice scenarios) are narrowly defined paradigms for studying much broader phenomena (moral reasoning and decision making under uncertainty). Their third example (masked cueing) does involve manipulating some task parameters beyond the stimuli themselves, but remains tied to a narrowly circumscribed task.

This might be fine if we fully understood the relationship between tasks and the underlying cognitive processes, but mostly we do not. Consider, for instance, measures of cognitive control – itself one of the most thoroughly investigated constructs in cognitive psychology. There are a number of popular tasks used to study cognitive control, including the masked cuing paradigm described by Almaatouq et al. Recently, one of us directly compared cognitive control as measured by three closely related tasks: The Simon, Stroop, and flanker tasks (Erb et al., Reference Erb, Germine and Hartshorne2023). Two massive online experiments with more than 20,000 participants revealed that these three tasks show strikingly different patterns of change in performance over the lifespan and near-zero correlations. Thus, integrative studies of cognitive control need to sample not just across stimuli but also paradigms. However, it is not clear that the differences across paradigms/tasks can be easily parameterized. Indeed, advances in our fields often owe themselves to the creation of new paradigms that open up new questions or comparisons.

Perhaps we are too pessimistic. Perhaps most questions resemble trolley problems and few resemble syntax or cognitive control (though we note that one of Almaatouq et al.'s three examples actually investigated cognitive control). But we would hate to predicate moving beyond the one-at-a-time approach on the widespread feasibility of parameterization. We worry that this licenses researchers (and editors and funders) to let perfect be the enemy of better – because we can do much better.

In particular, Almaatouq et al. may have only been able to find three examples of systematic exploration of the universe of possible experiments, but comprehensive explorations abound. This includes megastudies that test large, diverse sets of stimuli (e.g., Breithaupt, Li, & Kruschke, Reference Breithaupt, Li and Kruschke2022; Brysbaert, Stevens, Mandera, & Keuleers, Reference Brysbaert, Stevens, Mandera and Keuleers2016; De Deyne, Navarro, Perfors, Brysbaert, & Storms, Reference De Deyne, Navarro, Perfors, Brysbaert and Storms2019; Hartshorne, Bonial, & Palmer, Reference Hartshorne, Bonial and Palmer2014), broad subject demographics (e.g., Bleidorn et al., Reference Bleidorn, Klimstra, Denissen, Rentfrow, Potter and Gosling2013; Hartshorne, Tenenbaum, & Pinker, Reference Hartshorne, Tenenbaum and Pinker2018; Nosek, Banaji, & Greenwald, Reference Nosek, Banaji and Greenwald2002; Riley et al., Reference Riley, Okabe, Germine, Wilmer, Esterman and DeGutis2016; Soto, John, Gosling, & Potter, Reference Soto, John, Gosling and Potter2011; Spiers, Coutrot, & Hornberger, Reference Spiers, Coutrot and Hornberger2023), or a range of related tasks (e.g., Erb, Germine, & Hartshorne, Reference Erb, Germine and Hartshorne2023; Hampshire, Highfield, Parkin, & Owen, Reference Hampshire, Highfield, Parkin and Owen2012). Even without systematic exploration, these studies have produced major theoretical discoveries. They have also been instrumental in identifying important Almaatouq et al.-style parameters for subsequent systematic exploration. Critically, as Almaatouq et al. explain, the technology exists to conduct megastudies for most cognitive and behavioral questions, typically at lower aggregate cost than the status quo (see also Gosling & Mason, Reference Gosling and Mason2015; Li, Germine, Mehr, Srinivasan, & Hartshorne, Reference Li, Germine, Mehr, Srinivasan and Hartshorne2022; Long, Simson, Buxó-Lugo, Watson, & Mehr, Reference Long, Simson, Buxó-Lugo, Watson and Mehr2023). In short, our critique is of the “yes, and” variety. Yes, conduct systematic integrative metastudies when you can. And, when you cannot, conduct less systematic megastudies.

Financial support

The authors acknowledge funding from NSF 2229631 to J. K. H.

Competing interest

None.

References

Bleidorn, W., Klimstra, T. A., Denissen, J. J., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2013). Personality maturation around the world: A cross-cultural examination of social-investment theory. Psychological Science, 24(12), 2530–2540.CrossRef Google Scholar PubMed

Breithaupt, F., Li, B., & Kruschke, J. K. (2022). Serial reproduction of narratives preserves emotional appraisals. Cognition and Emotion, 36(4), 581–601.CrossRef Google Scholar PubMed

Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant's age. Frontiers in Psychology, 7, 1116.CrossRef Google Scholar PubMed

Chomsky, N. (2014). The minimalist program. MIT Press.CrossRef Google Scholar

Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior, 12(4), 335–359.CrossRef Google Scholar

De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2019). The “small world of words” English word association norms for over 12,000 cue words. Behavior Research Methods, 51, 987–1006.CrossRef Google Scholar

Erb, C. D., Germine, L., & Hartshorne, J. K. (2023). Fractionating cognitive control: Congruency tasks reveal divergent developmental trajectories. Journal of Experimental Psychology: General. https://doi.org/10.1037/xge0001429CrossRef Google Scholar

Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. University of Chicago Press.Google Scholar

Goldberg, A. E. (2009). The nature of generalization in language. Cognitive Linguistics, 20(1), 93–127.CrossRef Google Scholar

Gosling, S. D., & Mason, W. (2015). Internet research in psychology. Annual Review of Psychology, 66, 877–902.CrossRef Google Scholar PubMed

Hampshire, A., Highfield, R. R., Parkin, B. L., & Owen, A. M. (2012). Fractionating human intelligence. Neuron, 76(6), 1225–1237.CrossRef Google Scholar PubMed

Hartshorne, J. K., Bonial, C., & Palmer, M. (2014). The VerbCorner project: Findings from phase 1 of crowd-sourcing a semantic decomposition of verbs. In Proceedings of the 52nd annual meeting of the association for computational linguistics (vol. 2: Short Papers, pp. 397–402). Baltimore, Maryland, USA, 23–25 June 2014.Google Scholar

Hartshorne, J. K., & Snedeker, J. (2013). Verb argument structure predicts implicit causality: The advantages of finer-grained semantics. Language and Cognitive Processes, 28(10), 1474–1508.CrossRef Google Scholar

Hartshorne, J. K., Tenenbaum, J. B., & Pinker, S. (2018). A critical period for second language acquisition: Evidence from 2/3 million English speakers. Cognition, 177, 263–277.CrossRef Google Scholar PubMed

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2–3), 61–83.CrossRef Google Scholar PubMed

Hopcroft, J. E., Motwani, R., & Ullman, J. D. (2001). Introduction to automata theory, languages, and computation. ACM Sigact News, 32(1), 60–65.CrossRef Google Scholar

Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem. Journal of Personality and Social Psychology, 103(1), 54.CrossRef Google Scholar PubMed

Kuhn, T. S. (2012). The structure of scientific revolutions. University of Chicago Press.CrossRef Google Scholar

Li, W., Germine, L. T., Mehr, S. A., Srinivasan, M., & Hartshorne, J. K. (2022). Developmental psychologists should adopt citizen science to improve generalization and reproducibility. Infant and Child Development, e2348.CrossRef Google Scholar

Long, B., Simson, J., Buxó-Lugo, A., Watson, D. G., & Mehr, S. A. (2023). How games can make behavioural science better. Nature, 613(7944), 433–436.CrossRef Google Scholar

Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). Harvesting implicit group attitudes and beliefs from a demonstration web site. Group Dynamics: Theory, Research, and Practice, 6(1), 101.CrossRef Google Scholar

O'Donnell, T. J. (2015). Productivity and reuse in language: A theory of linguistic computation and storage. MIT Press.CrossRef Google Scholar

Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D., & Griffiths, T. L. (2021). Using large-scale experiments and machine learning to discover theories of human decision-making. Science (New York, N.Y.), 372(6547), 1209–1214.CrossRef Google Scholar PubMed

Riley, E., Okabe, H., Germine, L., Wilmer, J., Esterman, M., & DeGutis, J. (2016). Gender differences in sustained attentional control relate to gender inequality across countries. PLoS ONE, 11(11), e0165100.CrossRef Google Scholar PubMed

Soto, C. J., John, O. P., Gosling, S. D., & Potter, J. (2011). Age differences in personality traits from 10 to 65: Big five domains and facets in a large cross-sectional sample. Journal of Personality and Social Psychology, 100(2), 330.CrossRef Google Scholar

Spiers, H. J., Coutrot, A., & Hornberger, M. (2023). Explaining world-wide variation in navigation ability from millions of people: Citizen science project Sea Hero Quest. Topics in Cognitive Science, 15(1), 120–138.CrossRef Google Scholar PubMed

Steedman, M. (2001). The syntactic process. MIT Press.CrossRef Google Scholar

Ullman, M. T. (2001). The declarative/procedural model of lexicon and grammar. Journal of Psycholinguistic Research, 30, 37–69.CrossRef Google Scholar PubMed