Category Archives: objects of

Kahn Fellowship: Shaping perception

I’m about to start a fellowship at the Kahn Institute at Smith College. It is part of a project called “Shaping Perception” (for more information). What I hope to accomplish in this fellowship is a complete first draft of a monograph presenting a model of speech perception.

As I’ve noted in one earlier post, the model is being developed in opposition to a direct realist model of speech perception, as developed by Carol Fowler and others. As also noted in that post, a direct realist model of perception shares the assumption of other extended cognition models that much cognitive activity takes place outside the head in the world where perceived events occur. For speech sounds, direct realist and extended cognition models assume that the acoustics of the speech signal are information about the articulations that produced them. In the auditorist alternative that I espouse, the acoustic properties are not and cannot be informative when they’re out in the world, but only once they’re inside the perceiver’s head and have transformed into auditory qualities that can be evaluated by a linguistically informed nervous system.

Because models of extended cognition, and likewise models of embodied cognition share many assumptions with Husserl’s and Merleau-Ponty’s phenomenology, I expect to be spending  a good part of this fellowship year getting a better understanding of their work, too.


Why speech perception is neither embodied nor extended cognition

In what sense can speech perception be considered to be instances of either embodied or extended cognition? I suggest that it can be one or the other or perhaps both if the objects of speech perception, i.e. what the listener perceives are the speaker’s articulatory gestures.

According to the motor theory of speech perception (Liberman, et al., 1967, Liberman & Mattingly, 1985, 1989), listeners recognize speech sounds by emulating their articulation in their heads. The sound is recognized when the acoustic properties of the emulation match those of the speech sound heard. The perceived objects are the speaker’s articulatory intentions, in the form of neural instructions to the articulators, not the articulations themselves because the latter vary as much as the acoustic properties do as a function of coarticulation with neighboring sounds. Although the motor theory had a variety of motivations, its principal impetus was the need to explain the fact that listeners perceived acoustically very different sounds as the same speech sound. Perhaps, the most famous example of such perceptual invariance in the face of acoustic variability is that listeners perceive both the synthetic syllables whose spectrograms are displayed below as beginning with [d]:



Here are recreations of [di] and [du]. Do they sound like they begin with the same consonant? (If they don’t that could be because I had to use modern technology to produce them, rather than the Pattern Playback, which was used to produce the syllables whose spectrograms are displayed here.)

The first formants (F1s) of these syllables are the same, but they only convey that both [i] and [u] are close (= high) vowels and not that the preceding consonant is [d]. The consonant’s identity is instead conveyed by the initial changes (transitions) in the second formant’s frequency (F2). These transitions are obviously determined by F2’s values in the following vowel steady-states (the horizontal stretches when the formant doesn’t change value), but they don’t resemble one another in any way. The complete lack of resemblance of the two syllables’ F2 trajectories shows how extensive the acoustic variability caused by coarticulation can be. Producing close back rounded vowel [u] lowers F2 not only during the vowel itself but during the transition from the preceding [d], which therefore starts substantially lower than it does before [i], where the the close front unrounded articulation of [i] instead pulls the F2 onset and transition upward toward its high steady-state value. So it’s a surprise that these syllables sound like they begin with the same consonant.

The surprise disappears once we recognize that both consonants are articulated with the tongue tip and blade at the alveolar ridge, i.e. with apparently the same articulatory gesture. But how can listeners recognize that similarity? They use their ears after all to recognize speech sounds, and not x-ray machines that would let them see through the speaker’s cheeks. The motor theory’s answer is that they can do so through the emulation of the speaker’s articulations mentioned above. The emulation could be conceived as a mental model of the speaker’s vocal apparatus that can be manipulated to reproduce the speaker’s articulations and their acoustic consequences, which can be matched to the acoustic properties of the speech sound produced in the world.

Now, this is embodied cognition in that the cognitive act of perceiving a speech sound consists of emulating the speaker’s articulatory behavior mentally, i.e. behaving in the mind as the speaker would in the world.

According to the direct realist theory of speech perception (Fowler, 2006), no such emulation nor embodied cognition is necessary to perceive speech sounds because the speaker’s articulatory behavior in pronouncing them so structures the acoustic properties of the speech signal that the listener can identify the unique articulation that produced those properties. In other words, the acoustic properties are specific information about the articulations that produced them. To say that the acoustic properties are information about what the speaker said is of course innocuous. Direct realism makes a stronger claim: the array of acoustic properties uniquely specifies the articulatory gestures that produced it. If this is so, then the apparent perturbations of the acoustic properties of say a [d] caused by coarticulation with a following vowel become information about that coarticulation and the vowel’s identity. The signal can then be parsed into those acoustic properties that can be attributed to one articulation and those that can be attributed to another or others.

This is extended cognition because no mental act is required to extract information from the speech signal’s acoustic properties. The information is instead present and patent within those properties and no analysis is required to turn those properties into information. All the cognitive work, other than simply being around to hear the sounds, is done outside the listener’s head.

In my next post, I’ll take up the empirical and theoretical challenges to both the motor theory’s and direct realism’s accounts of speech perception, and thereby explain the “neither” in the title of this post.

Fowler, C. A. (2006). Compensation for coarticulation reflects gesture perception, not spectral contrast. Perception & Psychophysics. 68, 161-177.

LIberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review. 74, 431-461.

Liberman, A.M., & Mattingly, I.G. (1985). The motor theory of speech percepton revised. Cognition. 21, 1-36.

Liberman, A.M., & Mattingly, I.G. (1989). A specialization for speech perception. Science. 243, 489-494.

What’s out there?

I’ve been thinking a lot about what people perceive when they hear someone talking. Let’s set aside for another time what they perceive if they can see and are watching the speaker’s face, and just focus on what their ears tell them.

If you ask lay people what they perceive, or more precisely, “What did you just hear?” right after someone has finished speaking, they will almost certainly report back the words the person said. They may not remember them all, and they may confuse them with other words the person could have said, but what they’ll report is words. (By saying they report words, I don’t mean to imply that they don’t report the phrases and sentences composed of these words, but only use “words” to refer to any and all meaningful and individually identifiable constituents  that listeners report hearing in the speaker’s utterance).

The other thing they can reliably report is characteristics of who said them. If they recognize the voice, they’ll report the speaker’s name, but even if they don’t, they can still report characteristics of the speaker, e.g. their gender, perhaps their dialect, an impression of their body size, and their emotional state. Both the words and the speaker ‘s name and characteristics are somehow the salient characteristics of their behavior.

You may object to the “somehow” by retorting that the speaker’s words and characteristics must be the salient properties of their behavior, as they convey the content of the speaker’s utterance. The speaker’s words and characteristics are what the utterance is about and it is to convey them that the speaker uttered them. One could therefore think of the listener’s report as showing that that they recognize and cooperate with the speaker’s intention. They may of course object to, misinterpret, and otherwise reject that content but their report shows that they understand that the speaker intended to convey that content.

But that’s what this post is about: why are the reportable properties of the utterance limited to the speaker’s words and characteristics or why are they limited to what the speaker’s utterance purports to be about? Why are listeners’ reports in this respect apparently so passive and accepting of what the speaker wished to convey? (As I acknowledged above already, they may reject that content and do so vociferously, but that rejection relies just as much on them attending to what words the speaker said and with what characteristics as if they slavishly accepted that the speaker’s truth as second only to the diety’s.) And why isn’t not everything else that characterizes the speaker’s behavior as a speaker also equally easy to report?

That is, what about all the rest of the properties of the utterance and the speaker’s behavior that listeners don’t report? They don’t report how many words they heard, what the initial sounds of those words are, how many syllables they consist of, whether any of the words rhymed or alliterated with each other.  I would imagine that they’d have a hard time reporting with confidence that the speaker’s third word began with a “p” or that it consisted of three syllables or for that matter that it was the third word. They also don’t report anything about the melody and rhythm of the utterance or where the pauses were, although they may report that particular words were emphasized, that the utterance was a statement or a question, or that it was spoken briskly without interruption or haltingly and with frequent disfluencies. Indeed, listeners seem to be largely unaware or incapable of noticing the mistakes the speaker makes, whether these be speech errors or grammatical ones. (One piece of evidence of this is that students of speech errors admit that they have to decide to focus on listening for errors rather that listening to “what the person is saying” if they are to pick up on even a fraction of the errors a speaker makes.) Even gross disfluencies go largely unnoticed unless they make it hard to recognize the speaker’s words or their syntactic and semantic relationships to one another. But again, I think that these properties, too, are seldom noticed, or at least seldom reported or even reportable.

Sometimes, people do comment on rhymes, alliteration, and the like that they noticed in the speaker’s utterance, but that’s rare, and may be treated as a strange, perverse, or even annoying thing to do. That such comments elicit such reactions suggests strongly that there are some properties that we’re expected to notice when listening to speech and other properties that we’re not ordinarily expected to notice, unless of course the utterance is an instance of verbal art and both speaker and listener recognize it as such. But why aren’t we expected to notice all these other characteristics? Another way of putting this question is: why is special indeed arduous training necessary to impart to the listener the skills they would need to report these other characteristics consistently? Similarly, why is training a listener to transcribe accurately so hard?

What I’m getting at here is a version of the distinction between the medium and the message, and likewise the distinction between the form of the utterance and its purpose or function (these are overlapping but not coterminous distinctions). Listeners can report the message (the purpose or function of the utterance), which was called the “content” above, but seldom spontaneously or accurately report anything about the medium (or form).

The problem that lurks behind this distinction and the listener’s apparent failure or even inability to report much if anything about the utterance’s form is that the listener cannot hope to detect the speaker’s message (purpose/function) if they do not attend at some level to the form of the utterance that conveys that message. To be sure, the next word may be more or less predictable from what’s gone before (I will not get into the debate here about how much listeners can or do actually predict from context), but much of the time it will be enough less predictable that listener actually has to listen to it — perceive its form — to grasp the speaker’s intended message.

So if a listener must attend to the utterance’s form to extract its content successfully, why can they not reliably report any of those aspects of its form enumerated above? It brings to mind the Pirate King’s verse from Gilbert and Sullivan’s Pirates of Penzance,

A paradox,
A most ingenious paradox!
We’ve quips and quibbles heard in flocks,
But none to beat this paradox!

and Frederic’s reply:

How quaint the ways of Paradox!
At common sense she gaily mocks!

(Here of course we notice both the rhyme and meter but then we’re supposed to and even if we weren’t attending Gilbert shoves it down our throats.)

The difference between what’s reliably reported and what typically cannot be may also underlie what have been referred to as the  “public” aspects of speech perception in recent work by Fowler and other proponents of direct realist models of speech perception. As I understand what is referred to by “public” it is the that the listener perceived someone talking, and more narrowly that someone talking is someone who is moving their articulators, producing articulatory “gestures” in direct realist term, and it is these gestures that are perceived. They are perceived because the acoustic signal produced by these gestures provides veridical information to the listener about what gestures produced it. (I will set aside for another post discussion of whether the relationship between the properties of that signal and gestures that produced them are sufficiently invariant that those properties do provide the information necessary to invert the transformation from gestures to acoustics and reliably recover the gestures from the acoustics. Suffice to say here that there are good reasons to be skeptical.)

The reader might object that the articulatory gestures are the form of the utterance, o which I’ve argued listeners apparently do not to attend, at least to the extent that they can accurately report its characteristics. There is nonetheless a parallel here, one that is perhaps close enough that it merits exposure. Let’s again return to the original question, “What did you just hear?” which is posed immediately after a speaker says something. As already discussed, the listener can report the speaker’s words and characteristics, what was referred to above as the utterances “content”. But they could also report that someone said something, rather than an owl hooting, a champagne glass crashing in the fire place, or the wind howling through the chimney. Or more precisely, they would report that so-and-so “spoke”.

What is speaking? In lay terms, it is those movements of the various articulators that produces speech sounds. That is, it is the actions of the speaker that cause a series of speech sounds to be produced. These actions are the articulatory gestures referred to above.

This focus on causes is central. The content of the speaker’s utterance might be thought of as its first cause. To convey them is what motivated, i.e. caused the speaker to speak in the first place. The articulatory gestures are the medium in which that content is first physically realized. (I necessarily conflate here the neurophysiological means by which the speaker’s intended utterance is conveyed to the muscles, the contraction of those muscles, and the movements they bring about in the “articulatory gestures”.) Those gestures in turn cause the acoustic signal that conveys that content through the air to the listener’s ear.

So what listeners can report apparently is the causes of their auditory experience when someone speaks, both the initial and initiating cause, the utterance’s content, and a necessary intermediate cause, that made the content physical. But what they can’t or don’t at least readily report is the effects those causes bring about, particularly any the myriad acoustic properties of the speech signal, the auditory qualities those properties evoke, their organization into sequences and grouping into prosodic units, their number, or any of a number of other properties that are just as physical and in a profound sense far more immediate and certainly closer to the listener than the speaker’s articulatory gestures ever could be, barring Helen Keller’s methods for recognizing the speaker’s message. Because they are closer, it is profoundly puzzling that listeners can say so little easily or reliably about them.

In my next post, I’ll begin to lay out a possible solution to this puzzle.