Models of vision need some action

Constantin Rothkopf; Frank Bremmer; Katja Fiehler; Katharina Dobs; Jochen Triesch

doi:10.1017/S0140525X23001577

Models of vision need some action

Published online by Cambridge University Press: 06 December 2023

Katharina Dobs and

Constantin Rothkopf: Affiliation:
Centre for Cognitive Science, Technical University of Darmstadt, Darmstadt, Germany constantin.rothkopf@cogsci.tu-darmstadt.de Frankfurt Institute for Advanced Studies, Goethe-Universität Frankfurt, Frankfurt am Main, Germany triesch@fias.uni-frankfurt.de Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/
Frank Bremmer: Affiliation:
Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/ Applied Physics and Neurophysics, University of Marburg, Marburg, Germany frank.bremmer@physik.uni-marburg.de
Katja Fiehler: Affiliation:
Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/ Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany Katja.Fiehler@psychol.uni-giessen.de katharina.dobs@psychol.uni-giessen.de
Katharina Dobs: Affiliation:
Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/ Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany Katja.Fiehler@psychol.uni-giessen.de katharina.dobs@psychol.uni-giessen.de
Jochen Triesch: Affiliation:
Frankfurt Institute for Advanced Studies, Goethe-Universität Frankfurt, Frankfurt am Main, Germany triesch@fias.uni-frankfurt.de Center for Mind, Brain and Behavior, University of Marburg and Justus Liebig University Giessen, Giessen, Germany HMWK-Clusterproject The Adaptive Mind, Hesse, Germany https://www.theadaptivemind.de/

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

Bowers et al. focus their criticisms on research that compares behavioral and brain data from the ventral stream with a class of deep neural networks for object recognition. While they are right to identify issues with current benchmarking research programs, they overlook a much more fundamental limitation of this literature: Disregarding the importance of action and interaction for perception.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 46 , 2023 , e405

DOI: https://doi.org/10.1017/S0140525X23001577 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

Computationally, perception, cognition, and action are inseparably intertwined in sequential, goal-directed behavior (Kessler, Frankenstein, & Rothkopf, Reference Kessler, Frankenstein and Rothkopf2022). However, the branch of research considered in Bowers et al. focuses on a single visual task, that of assigning single, discrete labels of object identity to images. This is as if the whole goal of human vision was to learn to shout out an appropriate word while being presented a random pile of photographs. But, in the words of Thomas H. Huxley, the nineteenth-century English biologist and anthropologist: “The great end of life is not knowledge but action.” Perception is not l'art-pour-l'art. Instead, it occurs continuously in space and time as we perform structured tasks in a complex and dynamic environment (Fiehler & Karimpur, Reference Fiehler and Karimpur2023). Perception guides action and action, in turn, impacts perception (Bremmer, Churan, & Lappe, Reference Bremmer, Churan and Lappe2017; Bremmer & Krekelberg, Reference Bremmer and Krekelberg2003; Eckmann, Klimmasch, Shi, & Triesch, Reference Eckmann, Klimmasch, Shi and Triesch2020; Fiehler, Brenner, & Spering, Reference Fiehler, Brenner and Spering2019). Without action, we could not make changes in the world or interact with others. Here we argue that many of the limitations of current deep neural networks (DNNs) pointed out by Bowers et al. are likely rooted in a flawed and limited framing of perception and implausible supervised learning objectives, that recent DNNs represent fruitful avenues for overcoming some of these limitations, but that we must extend current models to account for the different functions of vision: Perception, cognition, and action and how they interact. Acknowledging that perception and action are intimately related has fundamental consequences. Here we highlight five key consequences.

The sensory input to biological visual systems is highly structured as it unfolds during goal-directed behavior. Accordingly, DNNs should be trained not on independent images presented in random order with corresponding labels, but in self-supervised ways by observing continuous, structured datasets, that is, events unfolding in space and time. Many real-world objects, such as animals or faces, are not just static entities, but move dynamically and nonrigidly (Dobs, Bülthoff, & Schultz, Reference Dobs, Bülthoff and Schultz2018). One potential avenue currently being explored is using forms of time-based self-supervised deep learning (Orhan, Gupta, & Lake, Reference Orhan, Gupta and Lake2020; Schneider, Xu, Ernst, Yu, & Triesch, Reference Schneider, Xu, Ernst, Yu and Triesch2021; Zhuang et al., Reference Zhuang, Yan, Nayebi, Schrimpf, Frank, DiCarlo and Yamins2021), which form invariant object representations by mapping sequences of views onto close-by latent representations without the need for labels. These models also have the potential to capture dynamic aspects of object recognition, such as the perception of dynamic faces, which cannot be captured well by current models trained on static images (Jiahui et al., Reference Jiahui, Feilong, di Oleggio Castello, Nastase, Haxby and Gobbini2022).

The structure of sensory input is in large part dependent on the observer's own actions. Thus, object perception and vision in general can only be understood in the context of an active, exploratory, multi-sensory observer, a view also reflected in current experimental work (Ayzenberg & Behrmann, Reference Ayzenberg and Behrmann2023). Supervised approaches miss the impact of goal-directed action and interaction on structuring visual representations (Krakauer, Ghazanfar, Gomez-Marin, MacIver, & Poeppel, Reference Krakauer, Ghazanfar, Gomez-Marin, MacIver and Poeppel2017). Accordingly, models should learn in self-supervised ways while interacting with their environment. Indeed, visual representations have been shown to be dependent on the active visual policy (Rothkopf, Weisswange, & Triesch, Reference Rothkopf, Weisswange and Triesch2009). Going beyond pure self-supervised invariance learning, a recent approach considers the benefits of active control of the view point for learning object representations (Xu & Triesch, Reference Xu and Triesch2023). Mimicking visual input from self-generated object manipulations, it learns a hierarchical representation to satisfy the two complementary desiderata of being partly invariant to viewpoint changes while at the same time permitting to predict which action is responsible for a particular change in the representation.

Learning and adaptation must be a continuous process, not limited to discrete training and test phases, but occurring continually during extended interactions with the environment. Recent approaches involving DNNs have addressed the challenge of continual learning (Wang, Liu, Duan, Kong, & Tao, Reference Wang, Liu, Duan, Kong and Tao2022). However, the breadth of the required continuous adaptation to changing conditions (Roelfsema & Holtmaat, Reference Roelfsema and Holtmaat2018; Schmitt et al., Reference Schmitt, Schwenk, Schütz, Churan, Kaminiarz and Bremmer2021) and the delicate balance of the classic stability–plasticity dilemma are still open problems for current DNNs.

The learning objectives must permit rich and adaptive representations that can feed multiple forms of interacting with the world. Instead, many of the studies considered by Bowers et al. relate to the single task of object recognition simply because the vast majority of current DNN approaches to vision select a task that gets away with ignoring actions: Attaching labels to images. Few current NN models conceptualize visual tasks in terms of visual routines, with some exceptions applying the framework of reinforcement learning to sequential visual behaviors (Araslanov, Rothkopf, & Roth, Reference Araslanov, Rothkopf and Roth2019). Promising directions are to jointly investigate a broad range of visual tasks (Dwivedi, Bonner, Cichy, & Roig, Reference Dwivedi, Bonner, Cichy and Roig2021) and to investigate those computational visual tasks relevant for action, which are predominantly attributed to the dorsal stream, and considering ecologically relevant cost functions that can account for dorsal stream properties in the primate brain (Mineault, Bakhtiari, Richards, & Pack, Reference Mineault, Bakhtiari, Richards and Pack2021).

Models will need to properly compute the interactions of sensory uncertainties, internally model uncertain beliefs, and the action variabilities to successfully achieve the organism's goals in sequential, adaptive behavior. Bowers et al. do not mention uncertainty once in their article. Current DNN models are not well suited to the computations required for proper belief propagation in sequential perception and action under uncertainty as required in extended behavior, where they are inseparably intertwined. As an example, humans use their perception and their actions actively to shape their internal beliefs about landmarks in navigation (Kessler et al., Reference Kessler, Frankenstein and Rothkopf2022). In their critique, Bowers et al. ignore the major computational challenge, which requires making accurate causal inferences about the origins of uncertainty in sensory data and adaptive motor output (Straub & Rothkopf, Reference Straub and Rothkopf2022).

In conclusion, we agree with Bowers et al.'s critique, but if we want to fully understand human vision including object recognition, our models must embrace the fact that vision is intimately intertwined with action in behaving, goal-directed agents.

Financial support

The research reported herein was supported by the “The Adaptive Mind,” funded by the Excellence Program of the Hessian Ministry of Higher Education, Science, Research and Art.

Competing interest

None.

References

Araslanov, N., Rothkopf, C. A., & Roth, S. (2019). Actor-critic instance segmentation. In L. Davis, P. Torr, & S.-Z. Zhu (Eds.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, California, 16–20 June 2019 (pp. 8237–8246).CrossRef Google Scholar

Ayzenberg, V., & Behrmann, M. (2023). The where, what, and how of object recognition. Trends in Cognitive Sciences, 27, 335–336.CrossRef Google Scholar PubMed

Bremmer, F., Churan, J., & Lappe, M. (2017). Heading representations in primates are compressed by saccades. Nature Communications, 8, 920.CrossRef Google Scholar PubMed

Bremmer, F., & Krekelberg, B. (2003). Seeing and acting at the same time: Challenges for brain (and) research. Neuron, 38, 367–370.CrossRef Google Scholar PubMed

Dobs, K., Bülthoff, I., & Schultz, J. (2018). Use and usefulness of dynamic face stimuli for face perception studies – A review of behavioral findings and methodology. Frontiers in Psychology, 9, 1355.CrossRef Google Scholar PubMed

Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling functions of the visual cortex using task-specific deep neural networks. PLoS Computational Biology, 17(8), e1009267.CrossRef Google Scholar PubMed

Eckmann, S., Klimmasch, L., Shi, B. E., & Triesch, J. (2020). Active efficient coding explains the development of binocular vision and its failure in amblyopia. Proceedings of the National Academy of Sciences of the United States of America, 117(11), 6156–6162.CrossRef Google Scholar PubMed

Fiehler, K., Brenner, E., & Spering, M. (2019). Prediction in goal-directed action. Journal of Vision, 19(9), 10, 1–21.CrossRef Google Scholar PubMed

Fiehler, K., & Karimpur, H. (2023). Spatial coding for action across spatial scales. Nature Reviews Psychology, 2, 72–84.CrossRef Google Scholar

Jiahui, G., Feilong, M., di Oleggio Castello, M. V., Nastase, S. A., Haxby, J. V., & Gobbini, M. I. (2022). Modeling naturalistic face processing in humans with deep convolutional neural networks. bioRxiv, 1–39.Google Scholar

Kessler, F., Frankenstein, J., & Rothkopf, C. A. (2022). A dynamic Bayesian actor model explains endpoint variability in homing tasks. bioRxiv, 1–25.Google Scholar

Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., MacIver, M. A., & Poeppel, D. (2017). Neuroscience needs behavior: Correcting a reductionist bias. Neuron, 93, 480–490.CrossRef Google Scholar PubMed

Mineault, P. J., Bakhtiari, S., Richards, B. A., & Pack, C. C. (2021). Your head is there to move you around: Goal-driven models of the primate dorsal pathway. Advances in Neural Information Processing Systems, 34, 28757–28771.Google Scholar

Orhan, E., Gupta, V., & Lake, B. M. (2020). Self-supervised learning through the eyes of a child. Advances in Neural Information Processing Systems, 33, 9960–9971.Google Scholar

Roelfsema, P. R., & Holtmaat, A. (2018). Control of synaptic plasticity in deep cortical networks. Nature Reviews Neuroscience, 19, 166–180.CrossRef Google Scholar PubMed

Rothkopf, C. A., Weisswange, T. H., & Triesch, J. (2009). Learning independent causes in natural images explains the space variant oblique effect. In M. Amine, N. Enayati, & H. Li (Eds.), 2009 IEEE 8th international conference on development and learning, Shanghai, China, 5–7 June 2009 (pp. 1–6). IEEE.Google Scholar

Schmitt, C., Schwenk, J. C. B., Schütz, A., Churan, J., Kaminiarz, A., & Bremmer, F. (2021). Preattentive processing of visually guided self-motion in humans and monkeys. Progress in Neurobiology, 205, 102117.CrossRef Google Scholar PubMed

Schneider, F., Xu, X., Ernst, M. R., Yu, Z., & Triesch, J. (2021). Contrastive learning through time. In SVRHM 2021 Workshop@NeurIPS.Google Scholar

Straub, D., & Rothkopf, C. A. (2022). Putting perception into action with inverse optimal control for continuous psychophysics. eLife, 11, 76635.CrossRef Google Scholar PubMed

Wang, Z., Liu, L., Duan, Y., Kong, Y., & Tao, D. (2022). Continual learning with lifelong vision transformer. In R. Chellappa, J. Matas, L. Quan, & M. Shah (Eds.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, Louisiana, 19–24 June 2022 (pp. 171–181).CrossRef Google Scholar

Xu, X., & Triesch, J. (2023). CIPER: Combining invariant and equivariant representations using contrastive and predictive learning. http://arxiv.org/abs/2302.02330 CrossRef Google Scholar

Zhuang, C., Yan, S., Nayebi, A., Schrimpf, M., Frank, M. C., DiCarlo, J. J., & Yamins, D. L. (2021). Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences of the United States of America, 118(3), e2014196118.CrossRef Google Scholar PubMed