SOUND: A simple introduction to its production, perception, and representation
Cornelia Fales
home | staff/hours/map
| publications | working papers
Acoustic and Perceptual Worlds
It may be commonplace to point out that acoustic
reality and perceptual reality are different. In a live
performance situation, for example, no matter how still the
audience, the environment will be full of sounds extraneous to
the music. If a tape recorder were positioned somewhere in the
midst of such a situation, and if a segment of the resulting tape
were submitted to digital sound analysis, the results would
highlight the difference between what one heard during the
performance (what is presumably captured on the tape), and what
analysis confirms the tape actually contains. Sound analysis
reveals the behavior of sound in the physical world. In this
case, analysis would show that soundwaves from all the sound
sources in the environment -- the various instruments of the
performance, perhaps the stirring of the audience, or the sound
of vehicles passing beyond the confines of the performance
context -- the multitude of acoustic elements that make up each
of these sounds do not remain conveniently grouped by source.
Rather, the components of all these sounds mix together,
combining into a single, very complex waveform which is
represented on the tape and revealed through analysis. This is
because sound waves are additive, like waves in water,
multiplying in quality rather than quantity.
In the simplest possible terms, what digital
analysis uncovers are the acoustic features of the sounds
captured by the tape recorder; what are actually heard are the
perceptual features of the same sounds. The acoustic and
perceptual characteristics of sound are not the same, nor in many
cases is there a one-to-one correspondence between them.
Parameters of Sound
In a very general sense, sounds in a normal
environment consist of the acoustic elements each characterized
by a specific frequency, amplitude, and duration. In the
perceptual world, these parameters correspond to the sensations
of pitch, loudness, and existence over time, respectively. Again
in a general sense, sounds in the real world can be categorized
as periodic or aperiodic. Periodic sounds are those emitted by a
source that produces regular vibrations over time, resulting in a
collection of frequencies called harmonics, partials, or
overtones. Harmonic frequencies originating from the same source
are related in that they occur in multiples of the lowest
frequency, referred to as the fundamental frequency. Thus, a
collection of harmonically related frequencies, of which the
fundamental is 200 Hz, would occur with frequencies of 400 Hz,
600 Hz, 800 Hz, 1000 Hz, 1200 Hz, etc.
Noise and Tone
Perceptually, a periodic sound can be defined by
the fact that it usually produces a distinct sensation of tonal
pitch; the sustained portion of many musical instruments and much
of speech consists of periodic sound. Aperiodic sounds are those
which most often occur perceptually as noise. Acoustically, noise
is defined as a random collection of frequencies from a single
source which are not harmonically related and whose waveform is
therefore irregular. Different kinds and sensations of noise can
be distinguished by the bandwidth (or frequency range) in which
the random frequencies of the noisy signal occur. Sporadic noise
is a component of most natural musical instruments, especially in
the attack phase; it has been shown, in fact, that the noisy
onset of an instrumental tone is very important, in some cases
necessary, for listener recognition of familiar instruments. In
speech, noise is a primary component of many consonants.
Sustained and Impulsive Sounds
In addition to noise (aperiodic) and tone
(periodic), another basic division of sounds is based on the
difference between sustained and impulsive stimulus of vibrating
material. An impulsive, or percussive, sound is one for which the
vibrating part of an instrument is excited discontinuously or in
pulses, so that with each excitation, a tone is produced and
immediately begins to decay until the next excitation starts the
process again. Common examples of impulsive sounds are those
produced from plucked or struck instruments, such as the guitar,
most percussion instruments, the piano, etc.
A sustained sound is one for which the vibrating
column of an instrument is excited continuously, so that the
sound continues in a more or less steady state as long as the
excitation continues. Instruments producing sustained tones are
those which are bowed or blown, such as bowed chordaphones and
most aerophones.
Pitch, Timbre, and Vowel Quality
For periodic sounds such as those emitted by
musical instruments or the voice, the fundamental frequency - the
lowest harmonic - usually corresponds to the sensation of pitch.
When the instruments of an orchestra tune their instruments to a
standard A-440, they are each producing a soundwave of frequency
approximately 440 Hz. Since very often it is the fundamental
which is the loudest and lowest frequency, it was thought for a
long time that it was this frequency itself which
"sounded" the pitch above the frequencies of the other
harmonics. It is now understood that the fundamental does not
sound the pitch; rather pitch is determined by the pitch period
of the entire waveform which is the frequency distance between
any two consecutive harmonics. It has been seen, in fact, that
the pitch sensation of a tone remains the same, even if the
fundamental or the fundamental and several of the lower harmonics
are removed from the tone.
If a tuning orchestra consists of a multitude of
instruments all theoretically producing notes of the same
frequency with the same harmonics (440 Hz, 880 Hz, 1320 Hz, 1760
Hz, etc.), what is it that distinguishes the sounds of the
instruments from each other? Or if two speakers produce different
vowels on the same pitch, producing the same harmonics, what is
it that distinguishes the sounds? In most musical instruments,
the action of the musician on the instrument produces a source
wave consisting of the vibrations of the vibrating column -- a
string, a column of air, or a membrane, depending on the
classification of the instrument. This source wave, like other
periodic waves, has a full complement of harmonics, and when
their vibrations are conveyed to the instrument's resonator, they
undergo what is called a transfer function, illustrated
schematically in the Glossary, under the definition for
"transfer function".
Central to the transfer function is the fact that
all resonators have certain characteristic resonances, or
frequencies at which they vibrate most efficiently. When the
source wave hits the resonator, therefore, the resonator's
characteristic resonances act to filter the harmonics of the
source wave. Source harmonics that are close in frequency to one
of the more efficient resonant frequencies of the resonator will
vibrate with greater energy, thus increasing in amplitude. The
transfer function changes none of the frequencies of the source
harmonics; rather it selectively amplifies the harmonics close to
its characteristic resonances. In a steady-state portion of a
tone, the sensation of timbre depends on the relative amplitude
of harmonics over time. Thus, if two instruments play a tone on
the same pitch, they produce all the same harmonics; the
difference between the two instruments is the pattern of high and
low harmonics.
Speech
In speech, the functions of the lungs and vocal
cords correspond to the action of a musician on an instrument.
The pressure of air emitted from the lungs sets the cords into
motion, emitting "pulses" of air. These pulses set up a
periodic source wave consisting of a fundamental vibration and
its harmonics. When the vocal cords are tightened by the speaker,
the frequency of the pulses emitted by the cords increases, and
the fundamental and all its harmonics rise in pitch. Above the
vocal cords, the vocal tract can be thought of as a constantly
changing resonator, as the speaker alters the place and manner of
articulation with each phoneme. The source wave passes through
the vocal tract undergoing the same transfer function that occurs
with the resonator of a musical instrument.
Like the instrumental resonator, the vocal tract
acts as a selective filter, reinforcing the energy of some of the
source wave's frequencies, depending on the resonant frequencies
of the vocal tract in any particular conformation. Because of the
relative softness of the vocal tract as resonator, its resonant
frequencies are much broader in bandwidth (cover a wider range of
frequencies) than instrumental resonators constructed of wood or
metal. Therefore, rather than emphasizing one or several
individual harmonics as occurs with instrumental resonators, the
vocal tract emphasizes an entire band of harmonics, called
formants. The result is that each vowel sound has characteristic
formants consisting of bands of higher intensity harmonics.
Vocal v. Instrumental Timbre
In both music and vowel sounds, then, the
distinguishing quality between two sounds of the same pitch is
the timbre or tone quality. And in both music and vowel sounds,
timbre is dependent on the relative amplitudes of harmonics. As a
musical instrument, however, the voice differs from other
instruments in several ways, of which a few have already been
mentioned. The first of these is that the voice is an immensely
versatile instrument because of its variable resonator, whose
resonant frequencies change with the articulation of each vowel.
In fact, a strict analogy between the singing
voice and other musical instruments might insist that the voice
is comparable to a collection of instruments, whose pitch
ranges are the same, but whose timbres are different. However,
even if one considers each vowel to constitute a separate timbre
(defining a separate instrument), there are still several
characteristics of the voice as instrument that act not so much
to set it apart from all other musical instruments as to position
it at one end of a continuum of timbre characteristics. The
defining characteristics of the timbre continuum - those which
the voice possesses to a greater degree than other instruments -
revolve around the difference between a harmonic and a formant.
This distinction has been the source of a
longstanding argument in auditory research as to whether
perception of instrumental timbre depends on the existence of
formant frequencies or harmonic structure. Whereas harmonics are
individual frequencies that are perceptually fused together into
a unitary sensation of tone quality, formants are broader-band
regions of intense energy. Understood in this way, the argument
centers on two issues: 1) whether the salient perceptual features
of a tone consist of individual high-intensity harmonics or
high-intensity bands of harmonics; 2) whether the tone
quality of an instrument is uniform over its entire pitch range,
indicating that formant structure remains constant, regardless of
the harmonics that fill it.
The first of these issues concerns the bandwidth
of harmonics effected by the enhancing properties of the
resonator; as mentioned above, the softer the material from which
a resonator is constructed, the broader its resonant frequencies,
and the broader the band of harmonics that will be amplified.
Vocal timbre is characterized by wide formants due to the
softness of the vocal tract, while the timbre of a flute, for
example, whose resonator is comparatively hard, depends on
irregularly spaced, single harmonics emphasized over the others.
The second issue, the constancy of an instrument's timbral
quality over its pitch range, depends on the relationship, or
coupling, between the source vibration and resonator of the
instrument. As a vibrating system, the voice displays the loosest
coupling between the source and resonator of all instruments, and
is thus capable of the most independent variation of pitch and
timbre; tone languages, in particular, depend greatly on this
division of labor. A given vowel will maintain its timbral
quality no matter what pitch it is pronounced or sung on; a more
tightly coupled system, as is typical of many reed instruments,
for instance, is characterized by significant timbre variation
from its lowest to highest pitch. In addition to the voice, other
formanted, timbre-constant instruments are stringed instruments,
especially if constructed of softer wood, which might follow the
voice on the timbre continuum. At the other end of the continuum
might be instruments that depend on overblown harmonics for
certain pitches, like the flute and many reed instruments,
especially those made from harder wood or metal.
Whether or not one can speak of a true continuum
of timbre types, the general opinion at present is that some
instruments exhibit formant structure and some instruments depend
on the salience of individual harmonics. Though these
characteristics are purely acoustic in nature, and ought in
theory to be visibly measurable, they are musically important as
well. Those instruments that sound more "voicelike" are
inevitably positioned on the voice end of the continuum.
Traditional musicians appear fascinated with the quality of
formanted instruments: many "talking" instruments fall
into the formanted category, and most instruments that use
overtones to produce a second pitch above the drone of the
fundamental rely on the presence of formants from which a single
harmonic can be extracted.
Auditory Scene Analysis
As noted above, digital analysis of a recorded
live performance would show that a conglomerate waveform
consisting of all frequencies at differing amplitudes from all
sources contained in the environment. This identical to the
signal entering the ears of any listener sitting in the audience.
Unlike most systems of computer analysis of sound, which are
incapable of separating the complex waveform into its individual
source components, a human listener is able to parse or resolve
the signal, thus determining which sources have emitted sounds
contributing to that conglomerate. To do this, the auditory
processing mechanisms of the listener must first isolate from the
signal individual acoustic elements, in most cases single
frequencies, which are then recombined according to source. This
resolution will result in various source components, consisting
partly of periodic waves, each with a fundamental frequency and
appropriate harmonics, and partly of a of aperiodic waves,
arriving from noise-producing sources, perhaps extraneous to the
performance.
The separation of a conglomerate or complex
waveform into its individual elements and the recombination of
these elements into source-dependent units and then into sound
sensations is a process described by S. McAdams as the formation
of "auditory images" (1982), and A. Bregman as
"auditory scene analysis" (1990). Both of these phrases
convey something of the interpretive quality of the organization
process as McAdams and Bregman conceive it. Part of this
interpretive quality results from the fact that the listener must
engage in a "conversion process," that translates acoustic
features into perceptual features. The regrouping of
individual acoustic elements into "sound units" by the
central auditory system is thought to entail
"hypotheses" as to the nature of the source emitting
the signal to be processed; the importance of a source
hypothesis, formed on the evidence of incoming signals, is that
it can then direct a continuing search for acoustic evidence to
confirm the hypotheses. According to this theory of auditory
processing, if there are competing hypotheses, then the grouping
with the greatest amount of evidence for a possible source will
be chosen. It is the competition between possible percepts that
is responsible for the interpretive quality of the process, and
it is the final translation that allows the listener to identify
sources.
Because the possible combination of sounds that
may occur at any one time is virtually unlimited, the auditory
processing system has developed what appear to be principles of
organization to assist in hypothesis forming. Research has
uncovered an immense amount of information on the "cognitive
logic" that informs a source hypothesis. A generally
accepted tenet about the nature of this logic is that it appears
to depend on the characteristics of real sources in the acoustic
world. For example, real world sources often emit frequencies
that occur in harmonic relation, as described above. Similarly,
when real world sources emit consecutive tones over time,
these tones are often close together in frequency. If, therefore,
the auditory processing mechanism determines that among the
acoustic elements of an incoming signal, there are frequencies
that are harmonically related, or that there are tones close in
frequency that reoccur over time, then these two facts constitute
evidence for two groupings of elements from the signal.
If auditory perception in most situations is
guided by an inherent understanding of source characteristics in
the acoustic world, in a music performance situation, auditory
perception is notable for its frequent disregard of sources.
Efforts to blend the instruments of an ensemble, for example, act
to disrupt the normal source-orientation of auditory grouping.
Many genres of traditional music function by confounding sources,
encouraging the inclusion into a single perceived sound unit of
elements emitted by more than one source, or by provoking
listeners to hear the same combination of sound in two different
ways. Techniques such as this result in what Bregman has called
an auditory chimera (1990), or in more subtle
"anomalies" like the augmented timbres described by G.
Sandell (personal communication). Because musical sound may
disrupt the normal source-orientation of auditory processing, and
especially because the cognitive reasoning that replaces
source-orientation is often culturally conditioned, field
researchers ought ideally to investigate the actual perceptions
experienced by a listener. It is not enough for a researcher,
foreign to the music he/she is studying, to register his/her own
perceptions of the music under the assumption that there is only
one way to hear the sound components of the music; nor is it
enough to consider the acoustic information supplied by
digital analysis without having established the perception - both
indigenous and nonindigenous - that corresponds to the acoustic
elements uncovered by analysis.
Visual Representation of Sound
The papers presented here use two primary forms
of representation to deduce acoustic information, and to
demonstrate that information for readers. The two forms differ in
the point of view each takes to sound and the amount of detail
each supplies. Each can show two dimensions of sound
simultaneously: the spectrogram looks at time (horizontal axis)
by frequency (vertical axis) and the spectrum looks at frequency
(horizontal axis) by amplitude (vertical axis) as shown in the
schematic versions of included in the Glossary listing for
"spectrogram" and "spectrum". Thus a
spectrogram shows for a variable duration of time the harmonic
composition of a sound; it can not indicate amplitude in any
precise fashion, though degree of contrast in color is intended
to indicate relative degrees of intensity.
Spectrograms can be narrowband or wideband. The
mechanical differences in these are not important for our
purposes here, but the visual difference is that the narrowband
spectrogram shows more vertical or frequential detail,
including individual harmonics, while the wideband spectrogram
shows more horizontal or temporal detail, so that while
durations are easily measured, individual harmonics are visually
merged into formant regions. Narrowband spectrograms thus
demonstrate more efficiently the movement of harmonics, while
wideband spectrograms demonstrate more efficiently the movement
and duration of formants. Spectra, on the other hand, show the
relative amplitudes of formant regions or individual harmonics at
a single instant in time. If a spectrum is accurate, it is
particularly useful in determining the harmonicity of partials,
as well as their high intensity regions; a spectrum, then, gives
a more precise representation of a sound's timbre at any
particular moment.
It is important to remember, however, that often
the information provided by either form of presentation is an
approximation at best, limited by the resolution capabilities of
both the digitizer and the analyzer, as well as by the fineness
of detail possible in the graphic display of the software. It is
also important to be cautious in considering which details of the
visual representation of a sound sample are salient to the sound
as perceived; often the picture of a sound will include clearly
visible elements which are acoustically present in the sound but
too short in duration, or too soft in intensity to register
perceptually. A useful maxim in this regard is the following: If
a discrete element is filtered from a sound with no difference to
the resulting tonal sensation, then the element is unimportant to
the final percept and need not be considered in interpreting the
data, no matter how blatantly it appears in analysis.
Last updated
February 4, 2003 Address comments to savail@indiana.edu Copyright 2000-03,
The Trustees of Indiana University
|