The Mirror System Hypothesis (MSH), described in Chapter 1, asserts that recognition of manual actions may ground the evolution of the language-ready brain. More specifically, the hypothesis suggests that manual praxic actions provide the basis for the successive evolution of pantomime, then protosign and protospeech, and finally the articulatory actions (of hands, face and – most importantly for speech – voice) that define the phonology of language. But whereas a praxic action just is a praxic action, a communicative action (which is usually a compound of meaningless articulatory actions; see Goldstein, Byrd, and Saltzman, this volume, on duality of patterning) is about something else. We want to give an account of that relationship between the sign and the signified (Arbib, this volume, Section 1.4.3).
Words and sentences can be about many things and abstractions, or can have social import within a variety of speech acts. However, here we choose to focus our discussion by looking at two specific tasks of language in relation to a visually perceptible scene: (1) generating a description of the scene, and (2) answering a question about the scene. At one level, vision appears to be highly parallel, whereas producing or understanding a sentence appears to be essentially serial. However, in each case there is both low-level parallel processing (across the spatial dimension in vision, across the frequency spectrum in audition) and high-level seriality in time (a sequence of visual fixations or foci of attention in vision, a sequence of words in language).