I’ve been thinking a lot about what people perceive when they hear someone talking. Let’s set aside for another time what they perceive if they can see and are watching the speaker’s face, and just focus on what their ears tell them.
If you ask lay people what they perceive, or more precisely, “What did you just hear?” right after someone has finished speaking, they will almost certainly report back the words the person said. They may not remember them all, and they may confuse them with other words the person could have said, but what they’ll report is words. (By saying they report words, I don’t mean to imply that they don’t report the phrases and sentences composed of these words, but only use “words” to refer to any and all meaningful and individually identifiable constituents that listeners report hearing in the speaker’s utterance).
The other thing they can reliably report is characteristics of who said them. If they recognize the voice, they’ll report the speaker’s name, but even if they don’t, they can still report characteristics of the speaker, e.g. their gender, perhaps their dialect, an impression of their body size, and their emotional state. Both the words and the speaker ‘s name and characteristics are somehow the salient characteristics of their behavior.
You may object to the “somehow” by retorting that the speaker’s words and characteristics must be the salient properties of their behavior, as they convey the content of the speaker’s utterance. The speaker’s words and characteristics are what the utterance is about and it is to convey them that the speaker uttered them. One could therefore think of the listener’s report as showing that that they recognize and cooperate with the speaker’s intention. They may of course object to, misinterpret, and otherwise reject that content but their report shows that they understand that the speaker intended to convey that content.
But that’s what this post is about: why are the reportable properties of the utterance limited to the speaker’s words and characteristics or why are they limited to what the speaker’s utterance purports to be about? Why are listeners’ reports in this respect apparently so passive and accepting of what the speaker wished to convey? (As I acknowledged above already, they may reject that content and do so vociferously, but that rejection relies just as much on them attending to what words the speaker said and with what characteristics as if they slavishly accepted that the speaker’s truth as second only to the diety’s.) And why isn’t not everything else that characterizes the speaker’s behavior as a speaker also equally easy to report?
That is, what about all the rest of the properties of the utterance and the speaker’s behavior that listeners don’t report? They don’t report how many words they heard, what the initial sounds of those words are, how many syllables they consist of, whether any of the words rhymed or alliterated with each other. I would imagine that they’d have a hard time reporting with confidence that the speaker’s third word began with a “p” or that it consisted of three syllables or for that matter that it was the third word. They also don’t report anything about the melody and rhythm of the utterance or where the pauses were, although they may report that particular words were emphasized, that the utterance was a statement or a question, or that it was spoken briskly without interruption or haltingly and with frequent disfluencies. Indeed, listeners seem to be largely unaware or incapable of noticing the mistakes the speaker makes, whether these be speech errors or grammatical ones. (One piece of evidence of this is that students of speech errors admit that they have to decide to focus on listening for errors rather that listening to “what the person is saying” if they are to pick up on even a fraction of the errors a speaker makes.) Even gross disfluencies go largely unnoticed unless they make it hard to recognize the speaker’s words or their syntactic and semantic relationships to one another. But again, I think that these properties, too, are seldom noticed, or at least seldom reported or even reportable.
Sometimes, people do comment on rhymes, alliteration, and the like that they noticed in the speaker’s utterance, but that’s rare, and may be treated as a strange, perverse, or even annoying thing to do. That such comments elicit such reactions suggests strongly that there are some properties that we’re expected to notice when listening to speech and other properties that we’re not ordinarily expected to notice, unless of course the utterance is an instance of verbal art and both speaker and listener recognize it as such. But why aren’t we expected to notice all these other characteristics? Another way of putting this question is: why is special indeed arduous training necessary to impart to the listener the skills they would need to report these other characteristics consistently? Similarly, why is training a listener to transcribe accurately so hard?
What I’m getting at here is a version of the distinction between the medium and the message, and likewise the distinction between the form of the utterance and its purpose or function (these are overlapping but not coterminous distinctions). Listeners can report the message (the purpose or function of the utterance), which was called the “content” above, but seldom spontaneously or accurately report anything about the medium (or form).
The problem that lurks behind this distinction and the listener’s apparent failure or even inability to report much if anything about the utterance’s form is that the listener cannot hope to detect the speaker’s message (purpose/function) if they do not attend at some level to the form of the utterance that conveys that message. To be sure, the next word may be more or less predictable from what’s gone before (I will not get into the debate here about how much listeners can or do actually predict from context), but much of the time it will be enough less predictable that listener actually has to listen to it — perceive its form — to grasp the speaker’s intended message.
So if a listener must attend to the utterance’s form to extract its content successfully, why can they not reliably report any of those aspects of its form enumerated above? It brings to mind the Pirate King’s verse from Gilbert and Sullivan’s Pirates of Penzance,
A most ingenious paradox!
We’ve quips and quibbles heard in flocks,
But none to beat this paradox!
and Frederic’s reply:
How quaint the ways of Paradox!
At common sense she gaily mocks!
(Here of course we notice both the rhyme and meter but then we’re supposed to and even if we weren’t attending Gilbert shoves it down our throats.)
The difference between what’s reliably reported and what typically cannot be may also underlie what have been referred to as the “public” aspects of speech perception in recent work by Fowler and other proponents of direct realist models of speech perception. As I understand what is referred to by “public” it is the that the listener perceived someone talking, and more narrowly that someone talking is someone who is moving their articulators, producing articulatory “gestures” in direct realist term, and it is these gestures that are perceived. They are perceived because the acoustic signal produced by these gestures provides veridical information to the listener about what gestures produced it. (I will set aside for another post discussion of whether the relationship between the properties of that signal and gestures that produced them are sufficiently invariant that those properties do provide the information necessary to invert the transformation from gestures to acoustics and reliably recover the gestures from the acoustics. Suffice to say here that there are good reasons to be skeptical.)
The reader might object that the articulatory gestures are the form of the utterance, o which I’ve argued listeners apparently do not to attend, at least to the extent that they can accurately report its characteristics. There is nonetheless a parallel here, one that is perhaps close enough that it merits exposure. Let’s again return to the original question, “What did you just hear?” which is posed immediately after a speaker says something. As already discussed, the listener can report the speaker’s words and characteristics, what was referred to above as the utterances “content”. But they could also report that someone said something, rather than an owl hooting, a champagne glass crashing in the fire place, or the wind howling through the chimney. Or more precisely, they would report that so-and-so “spoke”.
What is speaking? In lay terms, it is those movements of the various articulators that produces speech sounds. That is, it is the actions of the speaker that cause a series of speech sounds to be produced. These actions are the articulatory gestures referred to above.
This focus on causes is central. The content of the speaker’s utterance might be thought of as its first cause. To convey them is what motivated, i.e. caused the speaker to speak in the first place. The articulatory gestures are the medium in which that content is first physically realized. (I necessarily conflate here the neurophysiological means by which the speaker’s intended utterance is conveyed to the muscles, the contraction of those muscles, and the movements they bring about in the “articulatory gestures”.) Those gestures in turn cause the acoustic signal that conveys that content through the air to the listener’s ear.
So what listeners can report apparently is the causes of their auditory experience when someone speaks, both the initial and initiating cause, the utterance’s content, and a necessary intermediate cause, that made the content physical. But what they can’t or don’t at least readily report is the effects those causes bring about, particularly any the myriad acoustic properties of the speech signal, the auditory qualities those properties evoke, their organization into sequences and grouping into prosodic units, their number, or any of a number of other properties that are just as physical and in a profound sense far more immediate and certainly closer to the listener than the speaker’s articulatory gestures ever could be, barring Helen Keller’s methods for recognizing the speaker’s message. Because they are closer, it is profoundly puzzling that listeners can say so little easily or reliably about them.
In my next post, I’ll begin to lay out a possible solution to this puzzle.