Category Archives: phonology

Why speech perception is neither embodied nor extended cognition

In what sense can speech perception be considered to be instances of either embodied or extended cognition? I suggest that it can be one or the other or perhaps both if the objects of speech perception, i.e. what the listener perceives are the speaker’s articulatory gestures.

According to the motor theory of speech perception (Liberman, et al., 1967, Liberman & Mattingly, 1985, 1989), listeners recognize speech sounds by emulating their articulation in their heads. The sound is recognized when the acoustic properties of the emulation match those of the speech sound heard. The perceived objects are the speaker’s articulatory intentions, in the form of neural instructions to the articulators, not the articulations themselves because the latter vary as much as the acoustic properties do as a function of coarticulation with neighboring sounds. Although the motor theory had a variety of motivations, its principal impetus was the need to explain the fact that listeners perceived acoustically very different sounds as the same speech sound. Perhaps, the most famous example of such perceptual invariance in the face of acoustic variability is that listeners perceive both the synthetic syllables whose spectrograms are displayed below as beginning with [d]:

SpeechDiDu

 

Here are recreations of [di] and [du]. Do they sound like they begin with the same consonant? (If they don’t that could be because I had to use modern technology to produce them, rather than the Pattern Playback, which was used to produce the syllables whose spectrograms are displayed here.)

The first formants (F1s) of these syllables are the same, but they only convey that both [i] and [u] are close (= high) vowels and not that the preceding consonant is [d]. The consonant’s identity is instead conveyed by the initial changes (transitions) in the second formant’s frequency (F2). These transitions are obviously determined by F2’s values in the following vowel steady-states (the horizontal stretches when the formant doesn’t change value), but they don’t resemble one another in any way. The complete lack of resemblance of the two syllables’ F2 trajectories shows how extensive the acoustic variability caused by coarticulation can be. Producing close back rounded vowel [u] lowers F2 not only during the vowel itself but during the transition from the preceding [d], which therefore starts substantially lower than it does before [i], where the the close front unrounded articulation of [i] instead pulls the F2 onset and transition upward toward its high steady-state value. So it’s a surprise that these syllables sound like they begin with the same consonant.

The surprise disappears once we recognize that both consonants are articulated with the tongue tip and blade at the alveolar ridge, i.e. with apparently the same articulatory gesture. But how can listeners recognize that similarity? They use their ears after all to recognize speech sounds, and not x-ray machines that would let them see through the speaker’s cheeks. The motor theory’s answer is that they can do so through the emulation of the speaker’s articulations mentioned above. The emulation could be conceived as a mental model of the speaker’s vocal apparatus that can be manipulated to reproduce the speaker’s articulations and their acoustic consequences, which can be matched to the acoustic properties of the speech sound produced in the world.

Now, this is embodied cognition in that the cognitive act of perceiving a speech sound consists of emulating the speaker’s articulatory behavior mentally, i.e. behaving in the mind as the speaker would in the world.

According to the direct realist theory of speech perception (Fowler, 2006), no such emulation nor embodied cognition is necessary to perceive speech sounds because the speaker’s articulatory behavior in pronouncing them so structures the acoustic properties of the speech signal that the listener can identify the unique articulation that produced those properties. In other words, the acoustic properties are specific information about the articulations that produced them. To say that the acoustic properties are information about what the speaker said is of course innocuous. Direct realism makes a stronger claim: the array of acoustic properties uniquely specifies the articulatory gestures that produced it. If this is so, then the apparent perturbations of the acoustic properties of say a [d] caused by coarticulation with a following vowel become information about that coarticulation and the vowel’s identity. The signal can then be parsed into those acoustic properties that can be attributed to one articulation and those that can be attributed to another or others.

This is extended cognition because no mental act is required to extract information from the speech signal’s acoustic properties. The information is instead present and patent within those properties and no analysis is required to turn those properties into information. All the cognitive work, other than simply being around to hear the sounds, is done outside the listener’s head.

In my next post, I’ll take up the empirical and theoretical challenges to both the motor theory’s and direct realism’s accounts of speech perception, and thereby explain the “neither” in the title of this post.

Fowler, C. A. (2006). Compensation for coarticulation reflects gesture perception, not spectral contrast. Perception & Psychophysics. 68, 161-177.

LIberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review. 74, 431-461.

Liberman, A.M., & Mattingly, I.G. (1985). The motor theory of speech percepton revised. Cognition. 21, 1-36.

Liberman, A.M., & Mattingly, I.G. (1989). A specialization for speech perception. Science. 243, 489-494.

What’s out there?

I’ve been thinking a lot about what people perceive when they hear someone talking. Let’s set aside for another time what they perceive if they can see and are watching the speaker’s face, and just focus on what their ears tell them.

If you ask lay people what they perceive, or more precisely, “What did you just hear?” right after someone has finished speaking, they will almost certainly report back the words the person said. They may not remember them all, and they may confuse them with other words the person could have said, but what they’ll report is words. (By saying they report words, I don’t mean to imply that they don’t report the phrases and sentences composed of these words, but only use “words” to refer to any and all meaningful and individually identifiable constituents  that listeners report hearing in the speaker’s utterance).

The other thing they can reliably report is characteristics of who said them. If they recognize the voice, they’ll report the speaker’s name, but even if they don’t, they can still report characteristics of the speaker, e.g. their gender, perhaps their dialect, an impression of their body size, and their emotional state. Both the words and the speaker ‘s name and characteristics are somehow the salient characteristics of their behavior.

You may object to the “somehow” by retorting that the speaker’s words and characteristics must be the salient properties of their behavior, as they convey the content of the speaker’s utterance. The speaker’s words and characteristics are what the utterance is about and it is to convey them that the speaker uttered them. One could therefore think of the listener’s report as showing that that they recognize and cooperate with the speaker’s intention. They may of course object to, misinterpret, and otherwise reject that content but their report shows that they understand that the speaker intended to convey that content.

But that’s what this post is about: why are the reportable properties of the utterance limited to the speaker’s words and characteristics or why are they limited to what the speaker’s utterance purports to be about? Why are listeners’ reports in this respect apparently so passive and accepting of what the speaker wished to convey? (As I acknowledged above already, they may reject that content and do so vociferously, but that rejection relies just as much on them attending to what words the speaker said and with what characteristics as if they slavishly accepted that the speaker’s truth as second only to the diety’s.) And why isn’t not everything else that characterizes the speaker’s behavior as a speaker also equally easy to report?

That is, what about all the rest of the properties of the utterance and the speaker’s behavior that listeners don’t report? They don’t report how many words they heard, what the initial sounds of those words are, how many syllables they consist of, whether any of the words rhymed or alliterated with each other.  I would imagine that they’d have a hard time reporting with confidence that the speaker’s third word began with a “p” or that it consisted of three syllables or for that matter that it was the third word. They also don’t report anything about the melody and rhythm of the utterance or where the pauses were, although they may report that particular words were emphasized, that the utterance was a statement or a question, or that it was spoken briskly without interruption or haltingly and with frequent disfluencies. Indeed, listeners seem to be largely unaware or incapable of noticing the mistakes the speaker makes, whether these be speech errors or grammatical ones. (One piece of evidence of this is that students of speech errors admit that they have to decide to focus on listening for errors rather that listening to “what the person is saying” if they are to pick up on even a fraction of the errors a speaker makes.) Even gross disfluencies go largely unnoticed unless they make it hard to recognize the speaker’s words or their syntactic and semantic relationships to one another. But again, I think that these properties, too, are seldom noticed, or at least seldom reported or even reportable.

Sometimes, people do comment on rhymes, alliteration, and the like that they noticed in the speaker’s utterance, but that’s rare, and may be treated as a strange, perverse, or even annoying thing to do. That such comments elicit such reactions suggests strongly that there are some properties that we’re expected to notice when listening to speech and other properties that we’re not ordinarily expected to notice, unless of course the utterance is an instance of verbal art and both speaker and listener recognize it as such. But why aren’t we expected to notice all these other characteristics? Another way of putting this question is: why is special indeed arduous training necessary to impart to the listener the skills they would need to report these other characteristics consistently? Similarly, why is training a listener to transcribe accurately so hard?

What I’m getting at here is a version of the distinction between the medium and the message, and likewise the distinction between the form of the utterance and its purpose or function (these are overlapping but not coterminous distinctions). Listeners can report the message (the purpose or function of the utterance), which was called the “content” above, but seldom spontaneously or accurately report anything about the medium (or form).

The problem that lurks behind this distinction and the listener’s apparent failure or even inability to report much if anything about the utterance’s form is that the listener cannot hope to detect the speaker’s message (purpose/function) if they do not attend at some level to the form of the utterance that conveys that message. To be sure, the next word may be more or less predictable from what’s gone before (I will not get into the debate here about how much listeners can or do actually predict from context), but much of the time it will be enough less predictable that listener actually has to listen to it — perceive its form — to grasp the speaker’s intended message.

So if a listener must attend to the utterance’s form to extract its content successfully, why can they not reliably report any of those aspects of its form enumerated above? It brings to mind the Pirate King’s verse from Gilbert and Sullivan’s Pirates of Penzance,

A paradox,
A most ingenious paradox!
We’ve quips and quibbles heard in flocks,
But none to beat this paradox!

and Frederic’s reply:

How quaint the ways of Paradox!
At common sense she gaily mocks!

(Here of course we notice both the rhyme and meter but then we’re supposed to and even if we weren’t attending Gilbert shoves it down our throats.)

The difference between what’s reliably reported and what typically cannot be may also underlie what have been referred to as the  “public” aspects of speech perception in recent work by Fowler and other proponents of direct realist models of speech perception. As I understand what is referred to by “public” it is the that the listener perceived someone talking, and more narrowly that someone talking is someone who is moving their articulators, producing articulatory “gestures” in direct realist term, and it is these gestures that are perceived. They are perceived because the acoustic signal produced by these gestures provides veridical information to the listener about what gestures produced it. (I will set aside for another post discussion of whether the relationship between the properties of that signal and gestures that produced them are sufficiently invariant that those properties do provide the information necessary to invert the transformation from gestures to acoustics and reliably recover the gestures from the acoustics. Suffice to say here that there are good reasons to be skeptical.)

The reader might object that the articulatory gestures are the form of the utterance, o which I’ve argued listeners apparently do not to attend, at least to the extent that they can accurately report its characteristics. There is nonetheless a parallel here, one that is perhaps close enough that it merits exposure. Let’s again return to the original question, “What did you just hear?” which is posed immediately after a speaker says something. As already discussed, the listener can report the speaker’s words and characteristics, what was referred to above as the utterances “content”. But they could also report that someone said something, rather than an owl hooting, a champagne glass crashing in the fire place, or the wind howling through the chimney. Or more precisely, they would report that so-and-so “spoke”.

What is speaking? In lay terms, it is those movements of the various articulators that produces speech sounds. That is, it is the actions of the speaker that cause a series of speech sounds to be produced. These actions are the articulatory gestures referred to above.

This focus on causes is central. The content of the speaker’s utterance might be thought of as its first cause. To convey them is what motivated, i.e. caused the speaker to speak in the first place. The articulatory gestures are the medium in which that content is first physically realized. (I necessarily conflate here the neurophysiological means by which the speaker’s intended utterance is conveyed to the muscles, the contraction of those muscles, and the movements they bring about in the “articulatory gestures”.) Those gestures in turn cause the acoustic signal that conveys that content through the air to the listener’s ear.

So what listeners can report apparently is the causes of their auditory experience when someone speaks, both the initial and initiating cause, the utterance’s content, and a necessary intermediate cause, that made the content physical. But what they can’t or don’t at least readily report is the effects those causes bring about, particularly any the myriad acoustic properties of the speech signal, the auditory qualities those properties evoke, their organization into sequences and grouping into prosodic units, their number, or any of a number of other properties that are just as physical and in a profound sense far more immediate and certainly closer to the listener than the speaker’s articulatory gestures ever could be, barring Helen Keller’s methods for recognizing the speaker’s message. Because they are closer, it is profoundly puzzling that listeners can say so little easily or reliably about them.

In my next post, I’ll begin to lay out a possible solution to this puzzle.

Keeping words out of your ears

This is the first post of what I hope will be a regular series of posts on topics in phonetics, phonology, linguistics, and any of the many things that I can connect to these topics. At times, I will indulge in polemic, but for the most part my purpose is to write informally about what I’m thinking about these topics. Comments and competing polemics are welcome!

Lately, I’ve been trying to work out how best to follow up experiments in which we’ve pitted the listeners’ application of their linguistic knowledge against an auditory process that may be linguistically naive.

The auditory process produces perceptual contrast between the target sound and a neighboring sound, its context. (See Lotto & Kluender, 1998 in Perception & Psychophysics for the first demonstration of such effects, and Lotto & Holt, 2006, also in P&P, for a general discussion of contrast effects. Contrast is the auditory alternative to compensation for coarticulation, see Fowler, 2006, for discussion and arguments against contrast as a perceptual effect. I’ll come back to the contrast versus compensation debate in future posts. For the time being, it’s enough that the context causes the target sound to sound different than the context. I’ll describe that effect as “contrast” but it could also be described as “compensation for coarticulation.”)

For example, we have shown that listeners are more likely to respond “p” to a stop from a [p-t] continuum following [i] than following [u]. They do so because [p] ordinarily concentrates energy at much lower frequencies in the spectrum than [t] does, while [i] concentrates it at much higher frequencies than [u] does. Thus, a stop whose energy concentration is half way between [t]’s high value and [p]’s low one will sound lower, i.e. more like [p], next to a sound like [i] that concentrates energy at high frequencies (and higher, more like [t] next to a sound like [u] that concentrates energy at low frequencies).

The linguistic knowledge our experiments tested is knowledge of what’s a word. That knowledge can either cooperate with this auditory effect, as for example when the preceding context is “kee_” [ki_], where “p” but not “t” makes a word, keep, or it can conflict, as when the preceding context is “mee_” [mi_] instead, where “t” and not “p” makes a word, meet.

We describe both effects as “biases” and distinguish them as “contrast” versus “lexical” biases.  In these stimuli, the preceding [i] or [u] is the source of the contrast bias, while the consonant preceding that vowel is the source of the lexical bias.  (The lexical bias is also known in the literature as the “Ganong” effect, after William Ganong, who first described it in a 1980 paper in the Journal of Experimental Psychology: Human Perception and Performance.)

All of our experiments so far have used materials like these, where the context that creates the contrast and the one that creates the lexical bias both occur in the same syllable. (The order of target sound, context sound, and the sound that determines the lexical bias have all been manipulated. If anyone wants to know, I can provide a full list of the stimuli.) Those experiments have shown that the two biases are effectively independent of one another.

Even so, we want to separate them in the stimuli, by delaying the moment when the lexical bias determines what word it is, that is, by delaying the lexical uniqueness point. For example, the uniqueness point in the word rebate is the vowel following the [b] (compare rebound), and the uniqueness point in redress is likewise the vowel following the [d] (compare reduce). So the listener would not know these words are rebate or redress until at least one sound later.

The [b] in rebate would contrast perceptually with [i] in the first syllable, while the [d] in redress would not. Would this contrast effect make the listener more likely to hypothesize that the next sound is [b] rather than [d]? If so, how could we test it? Right now, we’re considering a phoneme monitoring experiment, where we measure how quickly the listener responds that a “b” or “d” occurs in these words. If contrast increases the expectation of a [b], then listeners should be faster to respond “yes” to rebate and slower to respond “no” to redress when the sound they’re monitoring is [b]. The opposite effect would be expected if the preceding sound were [u] rather than [i] because then the [d] and not the [b] would contrast.

An alternative is an eyetracking experiment, where we show the two words on the screen, play one of them, and measure the probability and latency of first fixations to the two words as a function of whether the context and target contrast.

A whole host of questions come up (which is largely the reason for this post):

  1. Will this work even though the target sounds are unambiguous? One reason to be hopeful that it would is that we have eyetracking data showing contrast effects with unambiguous sounds — I’ll be posting on these at another time.
  2. Is phoneme monitoring the right task?
  3. Getting more to the heart of the problem, is the uniqueness point late enough that we’d effectively separate the lexical bias from the contrast bias?
  4. It won’t surprise you to learn that the lexicon of English is not perfectly designed for the purposes of this experiment. Among other problems, it’s hard to find: (a) equal numbers of words with all the combinations of vowel and consonant place we want (vowels: front versus back, consonants: coronal versus labial), (b) as noted, words with uniqueness points that are late enough, (c) words that contrast minimally up through the target sound and its context, (d) lists which are reasonably well-balanced for lexical statistics, (e) words that our likely participants, UMass-Amherst undergraduates are likely to know, etc. etc. The question here is: how much should any of this matter? Can’t we control these properties the best we can, while making sure we get enough items, and then include possible confounding factors in the model of the results?