Research & Creative Activity

Indiana University

Office of Research and the University Graduate School
Volume XVII, Number 1, April 1994

Remembering Voices

"Train."

Hmm, was that Jill's voice? You press the key under her name and check the computer screen.

MARY, says the screen.

Darn, you keep getting Jill and Mary confused. But no time for regrets, because here comes the next word over the headphones.

"Quart," you hear in your earphones.

Tom, definitely Tom. His voice is so deep he could sing bass in a barbershop quartet. You press his name.

TOM, says the screen.

Yes! Now go for two in a row.

"------," the next voice says. You don't even recognize the actual word because you're concentrating so hard on the sound, and you're sure it's Carol because her voice sounds a little like an old friend from high school.

KAREN, says the screen.

Argh, you were sure that was Carol. Maybe that is why you are not friends anymore, because she talked so much you never paid any attention to what she said.

You wait for the next word. But no more words are forth-coming, which means that your first day as a subject in David Pisoni's and Lynne Nygaard's word recognition experiment has drawn to a close.

Pisoni, professor of psychology and cognitive science and director of Indiana University's Speech Research Laboratory, is one of the nation's foremost authorities on spoken language processing. "We are interested in how people perceive and comprehend spoken language," says Pisoni in his own distinctive New York accent. "This involves everything from the perception of phonemes [the sounds of a particular language] and syllables to word recognition, to what we call lexical access, or how people locate and retrieve the sound and meanings of words in memory, to sentence comprehension and spoken language understanding."

In some of his most recent experiments, like the one depicted above, Pisoni and his two colleagues, Nygaard and Mitch Sommers, a former postdoctoral student, have been trying to determine how a speaker's voice affects what is perceived by the listener and stored in memory.

"About two years ago we began to look in some detail at what you remember about a person's voice," Pisoni says. "This is a new direction because most of the research that has been done in speech perception the last forty years has not been concerned in any way with the individual characteristics of a speaker's voice."

Subjects in the experiment (usually undergraduates from introductory psychology courses) were trained over a period of ten days to recognize ten different voices, five male and five female. On the final day, subjects heard a brand new series of words uttered by the same ten speakers to see whether they could identify the speaker's voice from novel utterances. Indeed they could, Pisoni and his colleagues found. He also found that when new words were presented amidst background noise, they were much more intelligible if they were spoken by one of the ten familiar voices than by an unfamiliar voice.

"These findings provide the first demonstration," write Pisoni and his colleagues, "that exposure to a talker's voice facilitates perceptual processing of the phonetic content of that speaker's novel utterances." This makes sense when you think about listening to, say, a new friend from Thailand. At first you might have trouble understanding some of the words. Eventually, though, you become familiar with the accent, and perception becomes much easier.

Pisoni believes that recognizing voices involves a type of memory called "procedural memory," or memory about how to do something. "This is a form of learning that is analogous to learning how to play an instrument, say a piano, or how to drive a car," he says. Except the car is someone else's voice, and instead of learning to press the gas pedal and turn the wheel, you are learning--in an unconscious way--to extract words from the flow of sound.

Along with computational procedures, Pisoni believes that you store other kinds of exquisitely detailed information about voices in memory: pitch, speaking rate, dialect, gender, emotional state, and other quirks and nuances. These so-called "indexical" or "personal" properties of speech are what help you recognize distinctive individuals speaking, as opposed to voices in the abstract. "This is what makes you Mark and me Dave," he says, "as opposed to you being a male talker and me being a male talker, or me being from New York and you being from the Midwest."

Until recently, most speech researchers had all but ignored these attributes of voices. Instead, they assumed that the mind somehow stripped away differences between talkers and stored words in memory in a very idealized, abstract way--freeze-dried coffee instead of the ripe Colombian bean.

Pisoni's work challenges this long-standing abstractionist view of speech processing. "These results," he says, "suggest that in fact neural representation of words in memory is much richer." And his work in voice recognition is tying in with recent research in other areas of perception and memory: "It is compatible with some recent findings that suggest that very fine episodic details of the stimulus environment are encoded by the nervous system."

Pisoni's research, then, is contributing to a better theoretical understanding of how perception and memory work, of the richness and complexity involved, for both speech and non-speech stimuli. But his interests extend beyond the purely theoretical. He is currently doing experiments to see if very young deaf children who have had cochlear implants (devices that restore partial hearing) encode and store voice details in memory the same way that people with ordinary hearing do. The results eventually may be helpful in improving the implants and developing better training for hearing-impaired children.

Pisoni is also on a quest for the Golden Voice. "We are interested in speakers who have Golden Voices, voices that are very intelligible," he says. What are the characteristics of such voices? "Well, nobody has investigated this problem. Why are some voices used on radio programs? Why are some voices intelligible and resistant to noise and why are some other voices easily degraded by background noise, and so on?" The answer may be that a Golden Voice is easy to encode perceptually; it is the voice that is very similar to lots of other voices.

The work Pisoni is most famous for outside cognitive psychology circles involves the opposite of the golden voice: the intoxicated voice, slurring words, replacing "s" sounds with "sh" sounds, shpeaking shlower and shlower. Several months after the Exxon Valdez oil spill, an investigator from the National Transportation Safety Board called Pisoni to see if he could determine through speech analysis whether Joseph Hazelwood, the captain of the Exxon Valdez, was intoxicated at the time of the accident.

Using sophisticated digital signal processing technology, Pisoni and two colleagues, Keith Johnson and Bob Bernacki, found changes in Hazelwood's voice consistent with changes in the voices of intoxicated subjects in controlled laboratory experiments. Although this does not prove that Hazelwood was drunk, it certainly suggests as much, and Pisoni is currently serving as an expert witness in a civil lawsuit filed by victims of the accident. The case could break new legal ground, since voice analysis techniques have never before been admitted as evidence in a trial to determine if an individual was intoxicated.

Pisoni's interests are not restricted to the human voice, golden, intoxicated, or otherwise. For many years he has been studying the perception and comprehension of computer-generated voices. Although synthesis technology is getting better, computer voices still sound mechanical and unnatural, and he has shown that this synthetic quality not only makes words harder to recognize but also interferes with higher-order processes, such as understanding and remembering sentences.

In fact, it was a computer voice that interested him in speech perception in the first place. "When I was an undergraduate psychology major in New York City in 1966," he explains, "one of my professors played a tape of a talking computer. This was the same computer that was used to produce the voice of HAL in 2001, singing 'Daisy.'" Hooked, Pisoni went on to get his Ph.D. in psycholinguistics (a brand-new field at the time) from the University of Michigan and then did postdoctoral work at Yale and MIT. He came to Indiana University in 1971, the same year Bob Knight made his appearance on campus.

Pisoni has helped build a program on speech research at IU that rivals Knight's in basketball. The Speech Research Laboratory here is one of a small handful of laboratories around the country doing cutting-edge speech processing research. Thanks to numerous grants and contracts Pisoni has been awarded from the National Institutes of Health, the National Science Foundation, and the Air Force, the laboratory is outfitted with computers and analog-to-digital converters and other impressive equipment that looks like it belongs on the bridge of the Starship Enterprise.

And Pisoni has surrounded himself with a cadre of bright, enthusiastic graduate students and postdoctoral fellows who speak glowingly of the climate he has created in the laboratory. "It is a nurturing intellectual environment," says Nygaard. "He very much wants his people to do well and to learn." A sign on the wall of the laboratory captures Pisoni's approach: "Give intelligent people powerful tools and they'll work wonders."

--Mark Buechler

See sidebar on Tip of the Tongue