Common ground
As vision scientists, we believe that an understanding of human visual processing should ultimately explain all visually driven behavior. Because vision operates – by definition – on visual input, a science of human vision ultimately requires “image-computable” models and theories that produce those models. Bowers et al. endorse this view as every psychology experiment they suggest focuses on the effects of manipulations of combinations of image pixels.
On empirical tests of vision models
As empirical vision scientists, we also believe that advances in understanding visual processing will arise from rigorous, community-transparent tests of model predictions against empirical observations from the brain (e.g., patterns of neural firing) and the mind (e.g., patterns of behavior). As such, we and others have contributed to the creation of an open-source platform where any member of the vision community can find the leading models, test new models, see the most model-disruptive experimental benchmarks, and add new benchmarks (www.brain-score.org; Schrimpf et al., Reference Schrimpf, Kubilius, Hong, Majaj, Rajalingham, Issa and DiCarlo2018, Reference Schrimpf, Kubilius, Lee, Murty, Apurva, Ajemian and DiCarlo2020).
The most constructive contribution of Bowers et al. is the identification of a set of human behavioral vision findings that the authors believe will not be well-predicted by currently leading deep artificial neural network (ANN) models (target article, sect. 4.1). To evaluate this claim, the Brain-Score community is turning these empirical findings into accessible benchmarks that current (and future) models of human visual processing can be evaluated on. The results of this evaluation, especially if these benchmarks indeed present a challenge for current ANN models, should and would motivate next steps in human vision modeling. We report the following status at the time of this writing:
• We have implemented a benchmark based on Baker and Elder (Reference Baker and Elder2022). We find that some ANN vision models are within the noise ceiling of the human data (based on resampling of the human data).
• Two of the papers (Puebla & Bowers, Reference Puebla and Bowers2022; Zhang, Bengio, Hardt, Recht, & Vinyals, Reference Zhang, Bengio, Hardt, Recht and Vinyals2021) evaluate the performance of some ANN models without a human reference. Thus these studies currently provide no empirical support for the target article's claim that current ANN models fail to capture human behavior. But human data could be collected to turn these into benchmarks.
• Three of the papers (Bowers & Jones, Reference Bowers and Jones2007; Mack, Gauthier, Sadr, & Palmeri, Reference Mack, Gauthier, Sadr and Palmeri2008; Saarela, Sayim, Westheimer, & Herzog, Reference Saarela, Sayim, Westheimer and Herzog2009) produced human behavioral data that ANN models do not yet have a standardized way to make predictions about, for example, reaction times. This is surmountable (e.g., Spoerer, Kietzmann, Mehrer, Charest, & Kriegeskorte, Reference Spoerer, Kietzmann, Mehrer, Charest and Kriegeskorte2020) and we view this as a goal for future models and for Brain-Score.
On current vision models
We are not dogmatically committed to any current deep ANN model of human vision, none of which are perfect models of human vision, as the Brain-Score effort helped illuminate. However, we disagree with Bowers et al.'s claim that deep ANNs are not the currently leading models of human ventral visual processing. Bowers et al. critique ANN models without offering a better alternative: They imply that better models exist or should exist, but do not elaborate on what those models are. In the absence of an alternative model, it is justifiable to refer to ANNs as the currently best models. In fact, as can be seen on Brain-Score, in addition to the ability of some ANN models to moderately well predict neural responses at multiple visual processing stages, those same ANN models do, to some extent, predict even quite challenging behavioral data patterns (Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann and Brendel2021; Rajalingham et al., Reference Rajalingham, Issa, Bashivan, Kar, Schmidt and DiCarlo2018).
Bowers et al. eschew community-transparent suites of benchmarks yet they imply an alternative notion of vision model evaluation, which is somehow not a suite of benchmarks. But again, they do not produce a feasible alternative. Of course, the model rankings produced by benchmarks also depend on the choice of datasets and metric used for evaluation. We will continue to help the Brain-Score community expand the range of datasets and we are not dogmatically committed to any particular choice of metric. Different subcommunities may prefer to initially focus on different metrics (e.g., to know the currently best behavioral model regardless of underlying brain alignment, or vice-versa), and Brain-Score should support those different benchmark weightings. But we see no alternative to support advances in models of vision other than an open, transparent, and community-driven way of model comparison.
On building new vision models
Bowers et al. appear to favor a classic approach in which a separate model is built for each psychological phenomenon, using specialized stimuli that are hand-crafted to enable certain visual features to be well-defined – for example, illusory contours or shape primitives. The appeal of this approach is that it reduces the complexity of a high-dimensional pixel input space into small intuitive sets of features that enable the formulation and testing of conceptual hypotheses about vision – for example, the mechanisms of a particular class of visual illusions. However, because this approach requires dramatically restricting the stimuli under consideration, such hypotheses often cover a near-zero fraction of image space. In our opinion, the idea that a universal scientific model of human vision will result from sets of fragmented explanations that only engage a tiny fraction of image space is illusory (Newell, Reference Newell1973).
In contrast, the approach of starting with image-computable models that we favor enables tangible progress toward a unified model of human vision. Transparent tracking of model shortcomings lights the path to this goal. We acknowledge that the image-computability requirement may make formulation of traditional conceptual tests of a model more challenging. But it, by no means, makes such tests impossible. Any pattern of behavioral data, including those discussed in the target article, should be translatable into a behavioral benchmark on Brain-Score.
Moving forward
Ultimately, we think that the advantages that image-computable models have in enabling evaluation of predictions about diverse visual stimuli and phenomena heavily outweighs their disadvantages. And maintaining and expanding a common evaluation scheme for image-computable models of vision is, in our view, a prerequisite for channeling the valuable contributions of vision science – across neuroscience, cognitive science, psychology, and computer vision – toward convergence on the best scientific models of human vision. Let's move forward!
Common ground
As vision scientists, we believe that an understanding of human visual processing should ultimately explain all visually driven behavior. Because vision operates – by definition – on visual input, a science of human vision ultimately requires “image-computable” models and theories that produce those models. Bowers et al. endorse this view as every psychology experiment they suggest focuses on the effects of manipulations of combinations of image pixels.
On empirical tests of vision models
As empirical vision scientists, we also believe that advances in understanding visual processing will arise from rigorous, community-transparent tests of model predictions against empirical observations from the brain (e.g., patterns of neural firing) and the mind (e.g., patterns of behavior). As such, we and others have contributed to the creation of an open-source platform where any member of the vision community can find the leading models, test new models, see the most model-disruptive experimental benchmarks, and add new benchmarks (www.brain-score.org; Schrimpf et al., Reference Schrimpf, Kubilius, Hong, Majaj, Rajalingham, Issa and DiCarlo2018, Reference Schrimpf, Kubilius, Lee, Murty, Apurva, Ajemian and DiCarlo2020).
The most constructive contribution of Bowers et al. is the identification of a set of human behavioral vision findings that the authors believe will not be well-predicted by currently leading deep artificial neural network (ANN) models (target article, sect. 4.1). To evaluate this claim, the Brain-Score community is turning these empirical findings into accessible benchmarks that current (and future) models of human visual processing can be evaluated on. The results of this evaluation, especially if these benchmarks indeed present a challenge for current ANN models, should and would motivate next steps in human vision modeling. We report the following status at the time of this writing:
• We have implemented a benchmark based on Baker and Elder (Reference Baker and Elder2022). We find that some ANN vision models are within the noise ceiling of the human data (based on resampling of the human data).
• Two of the papers (Puebla & Bowers, Reference Puebla and Bowers2022; Zhang, Bengio, Hardt, Recht, & Vinyals, Reference Zhang, Bengio, Hardt, Recht and Vinyals2021) evaluate the performance of some ANN models without a human reference. Thus these studies currently provide no empirical support for the target article's claim that current ANN models fail to capture human behavior. But human data could be collected to turn these into benchmarks.
• Three of the papers (Bowers & Jones, Reference Bowers and Jones2007; Mack, Gauthier, Sadr, & Palmeri, Reference Mack, Gauthier, Sadr and Palmeri2008; Saarela, Sayim, Westheimer, & Herzog, Reference Saarela, Sayim, Westheimer and Herzog2009) produced human behavioral data that ANN models do not yet have a standardized way to make predictions about, for example, reaction times. This is surmountable (e.g., Spoerer, Kietzmann, Mehrer, Charest, & Kriegeskorte, Reference Spoerer, Kietzmann, Mehrer, Charest and Kriegeskorte2020) and we view this as a goal for future models and for Brain-Score.
On current vision models
We are not dogmatically committed to any current deep ANN model of human vision, none of which are perfect models of human vision, as the Brain-Score effort helped illuminate. However, we disagree with Bowers et al.'s claim that deep ANNs are not the currently leading models of human ventral visual processing. Bowers et al. critique ANN models without offering a better alternative: They imply that better models exist or should exist, but do not elaborate on what those models are. In the absence of an alternative model, it is justifiable to refer to ANNs as the currently best models. In fact, as can be seen on Brain-Score, in addition to the ability of some ANN models to moderately well predict neural responses at multiple visual processing stages, those same ANN models do, to some extent, predict even quite challenging behavioral data patterns (Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann and Brendel2021; Rajalingham et al., Reference Rajalingham, Issa, Bashivan, Kar, Schmidt and DiCarlo2018).
Bowers et al. eschew community-transparent suites of benchmarks yet they imply an alternative notion of vision model evaluation, which is somehow not a suite of benchmarks. But again, they do not produce a feasible alternative. Of course, the model rankings produced by benchmarks also depend on the choice of datasets and metric used for evaluation. We will continue to help the Brain-Score community expand the range of datasets and we are not dogmatically committed to any particular choice of metric. Different subcommunities may prefer to initially focus on different metrics (e.g., to know the currently best behavioral model regardless of underlying brain alignment, or vice-versa), and Brain-Score should support those different benchmark weightings. But we see no alternative to support advances in models of vision other than an open, transparent, and community-driven way of model comparison.
On building new vision models
Bowers et al. appear to favor a classic approach in which a separate model is built for each psychological phenomenon, using specialized stimuli that are hand-crafted to enable certain visual features to be well-defined – for example, illusory contours or shape primitives. The appeal of this approach is that it reduces the complexity of a high-dimensional pixel input space into small intuitive sets of features that enable the formulation and testing of conceptual hypotheses about vision – for example, the mechanisms of a particular class of visual illusions. However, because this approach requires dramatically restricting the stimuli under consideration, such hypotheses often cover a near-zero fraction of image space. In our opinion, the idea that a universal scientific model of human vision will result from sets of fragmented explanations that only engage a tiny fraction of image space is illusory (Newell, Reference Newell1973).
In contrast, the approach of starting with image-computable models that we favor enables tangible progress toward a unified model of human vision. Transparent tracking of model shortcomings lights the path to this goal. We acknowledge that the image-computability requirement may make formulation of traditional conceptual tests of a model more challenging. But it, by no means, makes such tests impossible. Any pattern of behavioral data, including those discussed in the target article, should be translatable into a behavioral benchmark on Brain-Score.
Moving forward
Ultimately, we think that the advantages that image-computable models have in enabling evaluation of predictions about diverse visual stimuli and phenomena heavily outweighs their disadvantages. And maintaining and expanding a common evaluation scheme for image-computable models of vision is, in our view, a prerequisite for channeling the valuable contributions of vision science – across neuroscience, cognitive science, psychology, and computer vision – toward convergence on the best scientific models of human vision. Let's move forward!
Acknowledgments
We thank Kohitij Kar, Micheal Lee, Nancy Kanwisher, Nikolaus Kriegeskorte, and Chris Shay for helpful discussions and support.
Financial support
This work was supported in part by the Semiconductor Research Company (SRC) and DARPA (J. J. D.), Simons Foundation (542965, J. J. D.), Office of Naval Research (MURI N00014-21-1-2801; N00014-20-1-2589, J. J. D., D. L. K. Y.), and National Science Foundation (2124136, J. J. D.).
Competing interest
M. B. is a co-founder of Maddox AI. All other authors have no competing interest.