Computationally, perception, cognition, and action are inseparably intertwined in sequential, goal-directed behavior (Kessler, Frankenstein, & Rothkopf, Reference Kessler, Frankenstein and Rothkopf2022). However, the branch of research considered in Bowers et al. focuses on a single visual task, that of assigning single, discrete labels of object identity to images. This is as if the whole goal of human vision was to learn to shout out an appropriate word while being presented a random pile of photographs. But, in the words of Thomas H. Huxley, the nineteenth-century English biologist and anthropologist: “The great end of life is not knowledge but action.” Perception is not l'art-pour-l'art. Instead, it occurs continuously in space and time as we perform structured tasks in a complex and dynamic environment (Fiehler & Karimpur, Reference Fiehler and Karimpur2023). Perception guides action and action, in turn, impacts perception (Bremmer, Churan, & Lappe, Reference Bremmer, Churan and Lappe2017; Bremmer & Krekelberg, Reference Bremmer and Krekelberg2003; Eckmann, Klimmasch, Shi, & Triesch, Reference Eckmann, Klimmasch, Shi and Triesch2020; Fiehler, Brenner, & Spering, Reference Fiehler, Brenner and Spering2019). Without action, we could not make changes in the world or interact with others. Here we argue that many of the limitations of current deep neural networks (DNNs) pointed out by Bowers et al. are likely rooted in a flawed and limited framing of perception and implausible supervised learning objectives, that recent DNNs represent fruitful avenues for overcoming some of these limitations, but that we must extend current models to account for the different functions of vision: Perception, cognition, and action and how they interact. Acknowledging that perception and action are intimately related has fundamental consequences. Here we highlight five key consequences.
The sensory input to biological visual systems is highly structured as it unfolds during goal-directed behavior. Accordingly, DNNs should be trained not on independent images presented in random order with corresponding labels, but in self-supervised ways by observing continuous, structured datasets, that is, events unfolding in space and time. Many real-world objects, such as animals or faces, are not just static entities, but move dynamically and nonrigidly (Dobs, Bülthoff, & Schultz, Reference Dobs, Bülthoff and Schultz2018). One potential avenue currently being explored is using forms of time-based self-supervised deep learning (Orhan, Gupta, & Lake, Reference Orhan, Gupta and Lake2020; Schneider, Xu, Ernst, Yu, & Triesch, Reference Schneider, Xu, Ernst, Yu and Triesch2021; Zhuang et al., Reference Zhuang, Yan, Nayebi, Schrimpf, Frank, DiCarlo and Yamins2021), which form invariant object representations by mapping sequences of views onto close-by latent representations without the need for labels. These models also have the potential to capture dynamic aspects of object recognition, such as the perception of dynamic faces, which cannot be captured well by current models trained on static images (Jiahui et al., Reference Jiahui, Feilong, di Oleggio Castello, Nastase, Haxby and Gobbini2022).
The structure of sensory input is in large part dependent on the observer's own actions. Thus, object perception and vision in general can only be understood in the context of an active, exploratory, multi-sensory observer, a view also reflected in current experimental work (Ayzenberg & Behrmann, Reference Ayzenberg and Behrmann2023). Supervised approaches miss the impact of goal-directed action and interaction on structuring visual representations (Krakauer, Ghazanfar, Gomez-Marin, MacIver, & Poeppel, Reference Krakauer, Ghazanfar, Gomez-Marin, MacIver and Poeppel2017). Accordingly, models should learn in self-supervised ways while interacting with their environment. Indeed, visual representations have been shown to be dependent on the active visual policy (Rothkopf, Weisswange, & Triesch, Reference Rothkopf, Weisswange and Triesch2009). Going beyond pure self-supervised invariance learning, a recent approach considers the benefits of active control of the view point for learning object representations (Xu & Triesch, Reference Xu and Triesch2023). Mimicking visual input from self-generated object manipulations, it learns a hierarchical representation to satisfy the two complementary desiderata of being partly invariant to viewpoint changes while at the same time permitting to predict which action is responsible for a particular change in the representation.
Learning and adaptation must be a continuous process, not limited to discrete training and test phases, but occurring continually during extended interactions with the environment. Recent approaches involving DNNs have addressed the challenge of continual learning (Wang, Liu, Duan, Kong, & Tao, Reference Wang, Liu, Duan, Kong and Tao2022). However, the breadth of the required continuous adaptation to changing conditions (Roelfsema & Holtmaat, Reference Roelfsema and Holtmaat2018; Schmitt et al., Reference Schmitt, Schwenk, Schütz, Churan, Kaminiarz and Bremmer2021) and the delicate balance of the classic stability–plasticity dilemma are still open problems for current DNNs.
The learning objectives must permit rich and adaptive representations that can feed multiple forms of interacting with the world. Instead, many of the studies considered by Bowers et al. relate to the single task of object recognition simply because the vast majority of current DNN approaches to vision select a task that gets away with ignoring actions: Attaching labels to images. Few current NN models conceptualize visual tasks in terms of visual routines, with some exceptions applying the framework of reinforcement learning to sequential visual behaviors (Araslanov, Rothkopf, & Roth, Reference Araslanov, Rothkopf and Roth2019). Promising directions are to jointly investigate a broad range of visual tasks (Dwivedi, Bonner, Cichy, & Roig, Reference Dwivedi, Bonner, Cichy and Roig2021) and to investigate those computational visual tasks relevant for action, which are predominantly attributed to the dorsal stream, and considering ecologically relevant cost functions that can account for dorsal stream properties in the primate brain (Mineault, Bakhtiari, Richards, & Pack, Reference Mineault, Bakhtiari, Richards and Pack2021).
Models will need to properly compute the interactions of sensory uncertainties, internally model uncertain beliefs, and the action variabilities to successfully achieve the organism's goals in sequential, adaptive behavior. Bowers et al. do not mention uncertainty once in their article. Current DNN models are not well suited to the computations required for proper belief propagation in sequential perception and action under uncertainty as required in extended behavior, where they are inseparably intertwined. As an example, humans use their perception and their actions actively to shape their internal beliefs about landmarks in navigation (Kessler et al., Reference Kessler, Frankenstein and Rothkopf2022). In their critique, Bowers et al. ignore the major computational challenge, which requires making accurate causal inferences about the origins of uncertainty in sensory data and adaptive motor output (Straub & Rothkopf, Reference Straub and Rothkopf2022).
In conclusion, we agree with Bowers et al.'s critique, but if we want to fully understand human vision including object recognition, our models must embrace the fact that vision is intimately intertwined with action in behaving, goal-directed agents.
Computationally, perception, cognition, and action are inseparably intertwined in sequential, goal-directed behavior (Kessler, Frankenstein, & Rothkopf, Reference Kessler, Frankenstein and Rothkopf2022). However, the branch of research considered in Bowers et al. focuses on a single visual task, that of assigning single, discrete labels of object identity to images. This is as if the whole goal of human vision was to learn to shout out an appropriate word while being presented a random pile of photographs. But, in the words of Thomas H. Huxley, the nineteenth-century English biologist and anthropologist: “The great end of life is not knowledge but action.” Perception is not l'art-pour-l'art. Instead, it occurs continuously in space and time as we perform structured tasks in a complex and dynamic environment (Fiehler & Karimpur, Reference Fiehler and Karimpur2023). Perception guides action and action, in turn, impacts perception (Bremmer, Churan, & Lappe, Reference Bremmer, Churan and Lappe2017; Bremmer & Krekelberg, Reference Bremmer and Krekelberg2003; Eckmann, Klimmasch, Shi, & Triesch, Reference Eckmann, Klimmasch, Shi and Triesch2020; Fiehler, Brenner, & Spering, Reference Fiehler, Brenner and Spering2019). Without action, we could not make changes in the world or interact with others. Here we argue that many of the limitations of current deep neural networks (DNNs) pointed out by Bowers et al. are likely rooted in a flawed and limited framing of perception and implausible supervised learning objectives, that recent DNNs represent fruitful avenues for overcoming some of these limitations, but that we must extend current models to account for the different functions of vision: Perception, cognition, and action and how they interact. Acknowledging that perception and action are intimately related has fundamental consequences. Here we highlight five key consequences.
The sensory input to biological visual systems is highly structured as it unfolds during goal-directed behavior. Accordingly, DNNs should be trained not on independent images presented in random order with corresponding labels, but in self-supervised ways by observing continuous, structured datasets, that is, events unfolding in space and time. Many real-world objects, such as animals or faces, are not just static entities, but move dynamically and nonrigidly (Dobs, Bülthoff, & Schultz, Reference Dobs, Bülthoff and Schultz2018). One potential avenue currently being explored is using forms of time-based self-supervised deep learning (Orhan, Gupta, & Lake, Reference Orhan, Gupta and Lake2020; Schneider, Xu, Ernst, Yu, & Triesch, Reference Schneider, Xu, Ernst, Yu and Triesch2021; Zhuang et al., Reference Zhuang, Yan, Nayebi, Schrimpf, Frank, DiCarlo and Yamins2021), which form invariant object representations by mapping sequences of views onto close-by latent representations without the need for labels. These models also have the potential to capture dynamic aspects of object recognition, such as the perception of dynamic faces, which cannot be captured well by current models trained on static images (Jiahui et al., Reference Jiahui, Feilong, di Oleggio Castello, Nastase, Haxby and Gobbini2022).
The structure of sensory input is in large part dependent on the observer's own actions. Thus, object perception and vision in general can only be understood in the context of an active, exploratory, multi-sensory observer, a view also reflected in current experimental work (Ayzenberg & Behrmann, Reference Ayzenberg and Behrmann2023). Supervised approaches miss the impact of goal-directed action and interaction on structuring visual representations (Krakauer, Ghazanfar, Gomez-Marin, MacIver, & Poeppel, Reference Krakauer, Ghazanfar, Gomez-Marin, MacIver and Poeppel2017). Accordingly, models should learn in self-supervised ways while interacting with their environment. Indeed, visual representations have been shown to be dependent on the active visual policy (Rothkopf, Weisswange, & Triesch, Reference Rothkopf, Weisswange and Triesch2009). Going beyond pure self-supervised invariance learning, a recent approach considers the benefits of active control of the view point for learning object representations (Xu & Triesch, Reference Xu and Triesch2023). Mimicking visual input from self-generated object manipulations, it learns a hierarchical representation to satisfy the two complementary desiderata of being partly invariant to viewpoint changes while at the same time permitting to predict which action is responsible for a particular change in the representation.
Learning and adaptation must be a continuous process, not limited to discrete training and test phases, but occurring continually during extended interactions with the environment. Recent approaches involving DNNs have addressed the challenge of continual learning (Wang, Liu, Duan, Kong, & Tao, Reference Wang, Liu, Duan, Kong and Tao2022). However, the breadth of the required continuous adaptation to changing conditions (Roelfsema & Holtmaat, Reference Roelfsema and Holtmaat2018; Schmitt et al., Reference Schmitt, Schwenk, Schütz, Churan, Kaminiarz and Bremmer2021) and the delicate balance of the classic stability–plasticity dilemma are still open problems for current DNNs.
The learning objectives must permit rich and adaptive representations that can feed multiple forms of interacting with the world. Instead, many of the studies considered by Bowers et al. relate to the single task of object recognition simply because the vast majority of current DNN approaches to vision select a task that gets away with ignoring actions: Attaching labels to images. Few current NN models conceptualize visual tasks in terms of visual routines, with some exceptions applying the framework of reinforcement learning to sequential visual behaviors (Araslanov, Rothkopf, & Roth, Reference Araslanov, Rothkopf and Roth2019). Promising directions are to jointly investigate a broad range of visual tasks (Dwivedi, Bonner, Cichy, & Roig, Reference Dwivedi, Bonner, Cichy and Roig2021) and to investigate those computational visual tasks relevant for action, which are predominantly attributed to the dorsal stream, and considering ecologically relevant cost functions that can account for dorsal stream properties in the primate brain (Mineault, Bakhtiari, Richards, & Pack, Reference Mineault, Bakhtiari, Richards and Pack2021).
Models will need to properly compute the interactions of sensory uncertainties, internally model uncertain beliefs, and the action variabilities to successfully achieve the organism's goals in sequential, adaptive behavior. Bowers et al. do not mention uncertainty once in their article. Current DNN models are not well suited to the computations required for proper belief propagation in sequential perception and action under uncertainty as required in extended behavior, where they are inseparably intertwined. As an example, humans use their perception and their actions actively to shape their internal beliefs about landmarks in navigation (Kessler et al., Reference Kessler, Frankenstein and Rothkopf2022). In their critique, Bowers et al. ignore the major computational challenge, which requires making accurate causal inferences about the origins of uncertainty in sensory data and adaptive motor output (Straub & Rothkopf, Reference Straub and Rothkopf2022).
In conclusion, we agree with Bowers et al.'s critique, but if we want to fully understand human vision including object recognition, our models must embrace the fact that vision is intimately intertwined with action in behaving, goal-directed agents.
Financial support
The research reported herein was supported by the “The Adaptive Mind,” funded by the Excellence Program of the Hessian Ministry of Higher Education, Science, Research and Art.
Competing interest
None.