There is a fundamental, unbridgeable gap between DNNs and the visual cortex

Moshe Gur

doi:10.1017/S0140525X23001590

There is a fundamental, unbridgeable gap between DNNs and the visual cortex

Published online by Cambridge University Press: 06 December 2023

Moshe Gur

Show author details

Moshe Gur*: Affiliation:
Department of Biomedical Engineering, Technion, Haifa, Israel mogi@bm.technion.ac.il

Article contents

Abstract
Financial support
Competing interest
References

Rights & Permissions

Abstract

Deep neural networks (DNNs) are not just inadequate models of the visual system but are so different in their structure and functionality that they are not even on the same playing field. DNN units have almost nothing in common with neurons, and, unlike visual neurons, they are often fully connected. At best, DNNs can label inputs, while our object perception is both holistic and detail preserving. A feat that no computational system can achieve.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 46 , 2023 , e393

DOI: https://doi.org/10.1017/S0140525X23001590 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

The authors make a valuable contribution in pointing out that deep neural networks (DNNs) are not good models of the visual system since they rely on predictions and fail to account for results from many psychophysical experiments. However, the authors’ implicit acceptance that DNN basic structure and operational principles are a fair simulation of the visual system resulted in ignoring that DNNs are not just inadequate to represent biological vision but that they are not even on the same playing field.

The discovery of feature-selective cells in primates V1 and subsequent findings of face-selective cells in the monkey inferotemporal (IT) cortex have led to the dominant physiology-based view that object representation and recognition is encoded by sparse populations of expert cell ensembles (Freiwald & Tsao, Reference Freiwald and Tsao2010). This theory posits that, first, the image is analyzed into its basic elements, such as line segments, by V1 feature-selective cells. Then, after hierarchical convergence and integration of simple elements, expert cell ensembles, by their collective responses, represent objects uniquely.

Those physiological findings and models have inspired various computational models (Marr, Reference Marr2010), leading to current DNNs. Such models, though incorporating some neuronal-like characteristics, were not intended to emulate brain mechanisms; much of the perceived similarity between biological and artificial networks has more to do with terminology (e.g., cells, layers, learning, features) than with exact replication. The fundamental differences between biological and artificial object recognition mechanisms are manifested in functional anatomy, and, most importantly, in the inherent impossibility of DNNs, or any other computational mechanism, including expert ensembles, to replicate our perceptual abilities.

Functional anatomy

Profound differences between biological networks and DNNs can be found at all organizational levels. Single neurons with their complex electro-chemical interactions are unlike the schematic DNN “neurons.” As noted by Ullman (Reference Ullman2019), almost everything known about biological neurons was left out of their DNN counterparts. Connectivity between single elements presents another striking divergence between DNNs and the visual cortex. Typically, DNNs contain many layers with units that are connected to all other units, although there is no such connectivity in the visual cortex. Lateral connections within a cortical area are sparse; in V1 the intrinsic connections are only between same orientation columns (Stettler, Aniruddha, Bennett, & Gilbert, Reference Stettler, Aniruddha, Bennett and Gilbert2002). Feed-forward connections in the ventral pathway, V1 → V2 → V4 → IT, are not one-to-one but many-to-one leading to increase in receptive field size from minutes of arc in V1 to many degrees in the IT cortex (Rolls, Reference Rolls2012). Most feedback from higher to lower areas is less dense than the feed-forward one and does not target single cells (Rolls, Reference Rolls2012). Also, DNNs are usually composed of more than 20 layers including highly specialized ones whereas there are only four areas in the visual hierarchy which are remarkably similar in their basic anatomy. While such profound structural differences are in themselves sufficient to deny that DNNs “are the best models of biological vision,” even a larger gap between DNNs and biological reality stems from ultimate differences in functionality. Note that since both expert ensembles and DNN models share essential characteristics, such as hierarchical feature extraction and integration, and since the latter differ structurally from the visual cortex, showing that the biologically based model is not a good model of the visual cortex, rules out DNNs as well.

Invariance

We recognize an object even though it can vary greatly in appearance. In Figure 1 a triangle with many properties (e.g., size) and many variations within property is displayed. Combining properties and variations leads to an astronomical number of variations, yet all objects are recognized as triangles, are differentiated from each other, and from any of the similarly astronomical possible appearances of, say, a square. It is impossible for a small number of “expert” cells to generate a pattern that is unique to every single triangle yet differs from all squares-generated patterns. The same arguments apply, obviously, to all other object categories. Are we to assume then that there are ensembles dedicated to each of the many thousands object categories out there?

Figure 1. Triangle is recognized as such despite potentially appearing in any one of an astronomical number of variations.

Space and time characteristics of object perception

A collection (~0.6° in total width) of small dots displayed for ~50 ms is easily perceived as a turtle (Fig. 2; Gur, Reference Gur2021). This result is quite informative; the display size means that the dots must be represented by V1 cells which are the only ones with small enough receptive fields. The short exposure means that correctly judging the position and relationship between dots, which is essential for a holistic object perception, cannot result from V1 → V2 → V4 → IT hierarchical convergence since there is simply not enough time (Schmolesky et al., Reference Schmolesky, Youngchang, Hanes, Thompson, Leutgeb, Schall and Leventhal1998). This example is consistent with many studies showing an accurate perception of flashed objects (cf. Crouzet, Kirchner, & Thorpe, Reference Crouzet, Kirchner and Thorpe2010; Greene & Visani, Reference Greene and Visani2015; Gur, Reference Gur2018, Reference Gur2021; Keysers, Xiao, Foldiak, & Perrett, Reference Keysers, Xiao, Foldiak and Perrett2001). Thus, unlike the predictions of the expert ensemble theory, our perception is almost instantaneous, parallel, and detail preserving.

Figure 2. Animal is recognized even when its contour is represented by dots, and is displayed for ~50 ms.

Detailed and holistic

These two characteristics are contradictory to any integrative/computational system where the whole is derived by integrating over its parts. However, this is how we see the world. It is the visual system ability to perceive space simultaneously and in parallel that leads to the holistic yet detailed capacity. The world is perceived almost in a flash-like manner; all the elements’ position and features (size, shape, orientation, etc.) are available for an immediate decision – a building or a face? Such a discrimination between objects with such a retention of details would not have been possible if elements were integrated and serially sent across many synapses to areas downstream from V1.

Finally, any computational system, including those discussed here, can, at best, map output to input. To recognize an object, the system generates a label, “object #2319764,” which is devoid of all the object's constituent elements. This result is quite different from our perceptual experience where an object is perceived together with its minute details. This discrepancy between perceptual reality and even the best possible performance of a computational system clearly means that such a system cannot be a model of the visual system.

Financial support

This research received no specific grant from any funding agency, commercial, or not-for-profit sectors.

Competing interest

None.

References

Crouzet, S. M., Kirchner, H., & Thorpe, S. J. (2010). Fast saccades toward faces: Face detection in just 100 ms. Journal of Vision, 10, 10–17.CrossRef Google Scholar

Freiwald, W. A., & Tsao, D. Y. (2010). Functional compartmentalization and viewpoint generalization within the macaque face processing system. Science (New York, N.Y.), 330, 845–851.CrossRef Google Scholar PubMed

Greene, E., & Visani, A. (2015). Recognition of letters displayed as briefly flashed dot patterns. Attention Perception & Psychophysics, 77, 1955–1969.CrossRef Google Scholar PubMed

Gur, M. (2018). Very small faces are easily discriminated under long and short exposure times. Journal of Neurophysiology, 119, 1599–1607. doi:10.1152/jn.00622.2017CrossRef Google Scholar PubMed

Gur, M. (2021). Psychophysical evidence and perceptual observations show that object recognition is not hierarchical but is a parallel, simultaneous, egalitarian, non-computational system. BioRxiv, 1–25. doi:10.1101/2021.06.10.447325Google Scholar

Keysers, C., Xiao, D.-K., Foldiak, P., & Perrett, D. I. (2001). The speed of light. Journal of Cognitive Neuroscience, 13, 90–101.CrossRef Google Scholar

Marr, D. (2010). Vision. MIT Press.CrossRef Google Scholar

Rolls, E. T. (2012). Invariant visual object and face recognition: Neural and computational bases, and a model, VisNet. Frontiers in Computational Neuroscience, 6, 1–70.CrossRef Google Scholar

Schmolesky, M. T., Youngchang, W., Hanes, D. P., Thompson, K. G., Leutgeb, S., Schall, J. D., & Leventhal, A. G. (1998). Signal timing in the macaque visual system. Journal of Neurophysiology, 79, 3272–3278.CrossRef Google Scholar PubMed

Stettler, D. D., Aniruddha, D., Bennett, J., & Gilbert, C. D. (2002). Lateral connectivity and contextual interactions in macaque primary visual cortex. Neuron, 36, 739–750.CrossRef Google Scholar PubMed