In what sense can speech perception be considered to be instances of either embodied or extended cognition? I suggest that it can be one or the other or perhaps both if the objects of speech perception, i.e. what the listener perceives are the speaker’s articulatory gestures.
According to the motor theory of speech perception (Liberman, et al., 1967, Liberman & Mattingly, 1985, 1989), listeners recognize speech sounds by emulating their articulation in their heads. The sound is recognized when the acoustic properties of the emulation match those of the speech sound heard. The perceived objects are the speaker’s articulatory intentions, in the form of neural instructions to the articulators, not the articulations themselves because the latter vary as much as the acoustic properties do as a function of coarticulation with neighboring sounds. Although the motor theory had a variety of motivations, its principal impetus was the need to explain the fact that listeners perceived acoustically very different sounds as the same speech sound. Perhaps, the most famous example of such perceptual invariance in the face of acoustic variability is that listeners perceive both the synthetic syllables whose spectrograms are displayed below as beginning with [d]:
Here are recreations of [di] and [du]. Do they sound like they begin with the same consonant? (If they don’t that could be because I had to use modern technology to produce them, rather than the Pattern Playback, which was used to produce the syllables whose spectrograms are displayed here.)
The first formants (F1s) of these syllables are the same, but they only convey that both [i] and [u] are close (= high) vowels and not that the preceding consonant is [d]. The consonant’s identity is instead conveyed by the initial changes (transitions) in the second formant’s frequency (F2). These transitions are obviously determined by F2’s values in the following vowel steady-states (the horizontal stretches when the formant doesn’t change value), but they don’t resemble one another in any way. The complete lack of resemblance of the two syllables’ F2 trajectories shows how extensive the acoustic variability caused by coarticulation can be. Producing close back rounded vowel [u] lowers F2 not only during the vowel itself but during the transition from the preceding [d], which therefore starts substantially lower than it does before [i], where the the close front unrounded articulation of [i] instead pulls the F2 onset and transition upward toward its high steady-state value. So it’s a surprise that these syllables sound like they begin with the same consonant.
The surprise disappears once we recognize that both consonants are articulated with the tongue tip and blade at the alveolar ridge, i.e. with apparently the same articulatory gesture. But how can listeners recognize that similarity? They use their ears after all to recognize speech sounds, and not x-ray machines that would let them see through the speaker’s cheeks. The motor theory’s answer is that they can do so through the emulation of the speaker’s articulations mentioned above. The emulation could be conceived as a mental model of the speaker’s vocal apparatus that can be manipulated to reproduce the speaker’s articulations and their acoustic consequences, which can be matched to the acoustic properties of the speech sound produced in the world.
Now, this is embodied cognition in that the cognitive act of perceiving a speech sound consists of emulating the speaker’s articulatory behavior mentally, i.e. behaving in the mind as the speaker would in the world.
According to the direct realist theory of speech perception (Fowler, 2006), no such emulation nor embodied cognition is necessary to perceive speech sounds because the speaker’s articulatory behavior in pronouncing them so structures the acoustic properties of the speech signal that the listener can identify the unique articulation that produced those properties. In other words, the acoustic properties are specific information about the articulations that produced them. To say that the acoustic properties are information about what the speaker said is of course innocuous. Direct realism makes a stronger claim: the array of acoustic properties uniquely specifies the articulatory gestures that produced it. If this is so, then the apparent perturbations of the acoustic properties of say a [d] caused by coarticulation with a following vowel become information about that coarticulation and the vowel’s identity. The signal can then be parsed into those acoustic properties that can be attributed to one articulation and those that can be attributed to another or others.
This is extended cognition because no mental act is required to extract information from the speech signal’s acoustic properties. The information is instead present and patent within those properties and no analysis is required to turn those properties into information. All the cognitive work, other than simply being around to hear the sounds, is done outside the listener’s head.
In my next post, I’ll take up the empirical and theoretical challenges to both the motor theory’s and direct realism’s accounts of speech perception, and thereby explain the “neither” in the title of this post.