Hostname: page-component-cd9895bd7-7cvxr Total loading time: 0 Render date: 2024-12-22T09:31:13.056Z Has data issue: false hasContentIssue false

Neither hype nor gloom do DNNs justice

Published online by Cambridge University Press:  06 December 2023

Felix A. Wichmann
Affiliation:
Neural Information Processing Group, University of Tübingen, Tübingen, Germany felix.wichmann@tuebingen.de
Simon Kornblith
Affiliation:
Google Research, Brain Team, Toronto, ON, Canada skornblith@google.com geirhos@google.com
Robert Geirhos
Affiliation:
Google Research, Brain Team, Toronto, ON, Canada skornblith@google.com geirhos@google.com

Abstract

Neither the hype exemplified in some exaggerated claims about deep neural networks (DNNs), nor the gloom expressed by Bowers et al. do DNNs as models in vision science justice: DNNs rapidly evolve, and today's limitations are often tomorrow's successes. In addition, providing explanations as well as prediction and image-computability are model desiderata; one should not be favoured at the expense of the other.

Type
Open Peer Commentary
Copyright
Copyright © The Author(s), 2023. Published by Cambridge University Press

We agree with Bowers et al. that some of the quoted statements at the beginning of their target article about deep neural networks (DNNs) as “best models” are exaggerated – perhaps some of them bordering on scientific hype (Intemann, Reference Intemann2020). However, only the authors of such exaggerated statements are to blame, not DNNs: Instead of blaming DNNs, perhaps Bowers et al. should have engaged in a critical discussion of the increasingly widespread practice of rewarding impact and boldness over carefulness and modesty that allows hyperbole to flourish in science. This is unfortunate as the target article does mention a number of valid issues with DNNs in vision science and raises a number of valid concerns. For example, we fully agree that human vision is much more than recognising photographs of objects in scenes; we also fully agree there are still a number of important behavioural differences between DNNs and humans even in terms of core object recognition (DiCarlo, Zoccolan, & Rust, Reference DiCarlo, Zoccolan and Rust2012), that is, even when recognising photographs of objects in scenes, such as DNNs’ adversarial susceptibility (target article, sect. 4.1.1) or reliance on local rather than global features (target article, sect. 4.1.3). However, we do not subscribe to the somewhat gloomy view of DNNs in vision science expressed by Bowers et al. We believe that image-computable models are essential to the future of vision science, and DNNs are currently the most promising – albeit not yet fully adequate – model class for core object recognition.

Importantly, any behavioural differences between DNNs and humans can only be a snapshot in time – true as of today. Unlike Bowers et al. we do not see any evidence that future, novel DNN architectures, training data and regimes may not be able to overcome at least some of the limitations mentioned in the target article – and Bowers et al. certainly do not provide any convincing evidence why solving such tasks is beyond DNNs in principle, that is, forever. In just over a decade, DNNs have come a long way from AlexNet, and we still witness tremendous progress in deep learning. Until recently, DNNs lacked robustness to image distortions; now some match or outperform humans on many of them. DNNs made very different error patterns than humans; newer models achieve at least somewhat better consistency (Geirhos et al., Reference Geirhos, Narayanappa, Mitzkus, Thieringer, Bethge, Wichmann, Brendel, Ranzato, Beygelzimer, Dauphin, Liang and Wortman Vaughan2021). DNNs used to be texture-biased; now some are shape-biased similar to humans (Dehghani et al., Reference Dehghani, Djolonga, Mustafa, Padlewski, Heek, Gilmer and Houlsby2023). With DNNs, today's limitations are often tomorrow's success stories.

Yes, current DNNs still fail on a large number of “psychological tasks,” from (un-)crowding (Doerig, Bornet, Choung, & Herzog, Reference Doerig, Bornet, Choung and Herzog2020) to focusing on local rather than global shape (Baker, Lu, Erlikhman, & Kellman, Reference Baker, Lu, Erlikhman and Kellman2018), from similarity judgements (German & Jacobs, Reference German and Jacobs2020) to combinatorial judgements (Montero, Bowers, Costa, Ludwig, & Malhotra, Reference Montero, Bowers, Costa, Ludwig and Malhotra2022); furthermore, current DNNs lack (proper, human-like) sensitivity to Gestalt principles (Biscione & Bowers, Reference Biscione and Bowers2023). But current DNNs in vision are typically trained to recognise static images; their failure on “psychological tasks” without (perhaps radically) different training or different optimisation objectives does not surprise us – just as we do not expect a traditional vision model of motion processing to predict lightness induction or an early spatial vision model to predict Gestalt laws, at least not without substantial modification and fitting it to suitable data. To overcome current DNNs’ limitations on psychological tasks we need more DNN research inspired by vision science, not just engineering to improve models’ overall accuracy – here we certainly agree again with Bowers et al.

Moreover, for many of the abovementioned psychological tasks, there simply do not exist successful traditional vision models. Why single out DNNs as failures if no successful computational model exists, at least not image-computable models? Traditional “object recognition” models only model isolated aspects of object recognition, and it is difficult to tell how well they model these aspects, since only image-computable models can actually recognise objects. Here, image-computability is far more than just a “nice to have” criterion since it facilitates falsifiability. We think that Bowers et al.'s long list of DNN failures should rather be taken as a list of desiderata of what future image-computable models of human vision should explain and predict.

Although we do not know whether DNNs will be sufficient to meet this challenge, only future research will resolve the many open questions: Is our current approach of applying predominantly discriminative DNNs as computational models of human vision sufficient to obtain truly successful models? Do we need to incorporate, for example, causality (Pearl, Reference Pearl2009), or generative models such as predictive coding (Rao & Ballard, Reference Rao and Ballard1999) or even symbolic computations (Mao, Gan, Kohli, Tenenbaum, & Wu, Reference Mao, Gan, Kohli, Tenenbaum and Wu2019)? Do we need to ground learning in intuitive theories of physics and psychology (Lake, Ullman, Tenenbaum, & Gershman, Reference Lake, Ullman, Tenenbaum and Gershman2017)?

Finally, it appears as if Bowers et al. argue that models should first and foremost provide explanations, as if predictivity – which includes but is not limited to image-computability – did not matter much. (Or observational data; successful models need to be able to explain and predict data from hypothesis-driven experiments as well as observational data.) While we agree with Bowers et al. that in machine learning there is a tendency to blindly chase improved benchmark numbers without seeking understanding of underlying phenomena, we believe that both prediction and explanation are required: An explanation without prediction cannot be trusted, and a prediction without explanation does not aid understanding. What we need is not a myopic focus on one or the other, but to be more explicit about modelling goals – both in the target article by Bowers et al. and in general, as we argue in a forthcoming article (Wichmann & Geirhos, Reference Wichmann and Geirhos2023).

We think that neither the hype exemplified in some exaggerated claims about DNNs, nor the gloom expressed by Bowers et al. do DNNs and their application to vision science justice. Looking forward, if we want to make progress towards modelling and understanding human visual perception, we believe that it will be key to move beyond both hype and gloom and carefully explore similarities and differences between human vision and rapidly evolving DNNs.

Financial support

This work was supported by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP 4, project number: 276693517 to F. A. W. In addition, F. A. W. is a member of the Machine Learning Cluster of Excellence, funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC number 2064/1 – Project number 390727645.

Competing interest

None.

References

Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14(12), e1006613.CrossRefGoogle Scholar
Biscione, V., & Bowers, J. S. (2023). Mixed evidence for Gestalt grouping in deep neural networks. Computational Brain & Behavior, 119. https://doi.org/10.1007/s42113-023-00169-2Google Scholar
Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., … Houlsby, N. (2023). Scaling vision transformers to 22 billion parameters. arXiv, arXiv:2302.05442.Google Scholar
DiCarlo, J. J., Zoccolan, D., & Rust, N. C. (2012). How does the brain solve visual object recognition? Neuron, 73(3), 415434.CrossRefGoogle ScholarPubMed
Doerig, A., Bornet, A., Choung, O. H., & Herzog, M. H. (2020). Crowding reveals fundamental differences in local vs. global processing in humans and machines. Vision Research, 167, 3945.CrossRefGoogle ScholarPubMed
Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T., Bethge, M., Wichmann, F. A., & Brendel, W. (2021). Partial success in closing the gap between human and machine vision. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S., & Wortman Vaughan, J. (Eds.), Advances in neural information processing systems (Vol. 34, pp. 2388523899). Curran Associates.Google Scholar
German, J. S., & Jacobs, R. A. (2020). Can machine learning account for human visual object shape similarity judgments? Vision Research, 167, 8799.CrossRefGoogle ScholarPubMed
Intemann, K. (2020). Understanding the problem of “hype”: Exaggeration, values, and trust in science. Canadian Journal of Philosophy, 52(3), 116.Google Scholar
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, e253.CrossRefGoogle ScholarPubMed
Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., & Wu, J. (2019). The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations (ICLR), New Orleans, Louisiana, United States, pp. 1–28. https://iclr.cc/Conferences/2019Google Scholar
Montero, M. L., Bowers, J. S., Costa, R. P., Ludwig, C. J. H., & Malhotra, G. (2022). Lost in latent space: Disentangled models and the challenge of combinatorial generalisation. arXiv, arXiv:2204.02283.Google Scholar
Pearl, J. (2009). Causality (2nd ed.). Cambridge University Press.CrossRefGoogle Scholar
Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 7987.CrossRefGoogle ScholarPubMed
Wichmann, F. A., & Geirhos, R. (2023). Are deep neural networks adequate behavioural models of human visual perception? Annual Review of Vision Science, 9. https://doi.org/10.1146/annurev-vision-120522-031739CrossRefGoogle ScholarPubMed