Comments on Talker Normalization in Speech Perception

Author: David B. Pisoni

Introduction:
For many years, researchers working in the field of speech perception have assumed that the units derived from perceptual analysis of the speech signal are isomophic with the abstract idealized units postulated in linguistic theory. In his well-known review chapter, Studdert-Kennedy (1974) makes this point quite explicitly:

The signal is a more or less continuously varying acoustic wave, .... The message is a string of lexical and grammatical items that may be transcribed as an appropriately marked sequence of discrete phonemic symbols ....

Central to this approach has always been the assumption of some kind of abstraction process which takes the time-varying physical signal and converts it into a symbolic representation that is equivalent to a linear sequence of phonemes. Until recently, few researchers working on human speech perception had any reason to question this approach. Most theoretical accounts of speech perception have postulated some type of perceptual normalization process to compensate for the many sources of variability in the speech signal. Unfortunately, it is precisely these sources of variability in speech that have prevented researchers from making much progress in solving what is often called the "primary recognition problem," that is, the problem of mapping invariant attributes of the physical signal onto abstract linguistic units.

According to traditional accounts of speech perception, the process of perceptual normalization involves a substantial reduction in information and transformation of the signal into a discrete symbolic representation. Thus, Physically different tokens of the same word are made equivalent by removing irrelevant variability or "noise" from the speech signal. In the case of talker normalization, the topic of this session at the ATR workshop, it has been commonly assumed that the speech signal is "stripped" of its source characteristics. Detailed information about the talker is excluded from the symbolic representation. Only the linguistically distinctive features are preserved in memory in an idealized form.

Research from our laboratory over the last few years on the role of talker variability in speech perception demonstrates that this particular source of stimulus variability should not be thought of as just noise in the speech signal. Several of our experiments demonstrate that source characteristics -- detailed information about the talker -- become an integral part of the perceptual record and are encoded into long-term memory along with the more abstract symbolic representation derived from phonetic analysis. From these results, it appears that the representation of speech in memory is actually much richer and more detailed than is necessary for the linguist's description of the speech signal as an idealized sequence of meaningful units. It is possible that these more elaborate representations may provide important new clues to the pattern analyzing operations that human listeners use to map speech signals onto lexical representations in long-term memory. Thus, rather than thinking of stimulus variability as something that needs to be filtered out of the speech signal to extract thee idealized linguistic message, stimulus variability may actually provide very useful information to the listener about aspects of the speech that are used for its perceptual analysis and subsequence encoding into memory.

In this commentary on talker normalization, I will briefly review two early studies that examined the role of talker variability in speech perception. Then, I will report on several recent findings from my research group at Indiana. More details about these experiments are provided in my ICSLP-90 plenary lecture (Pisoni, 1990).