Learning Objectives
By the end of this section, you should be able to
- 7.3.1 Describe how circuits in the auditory system determine the location of sound sources by comparing the inputs arriving at each ear.
- 7.3.2 Relate the physical characteristics of sounds to auditory perception.
- 7.3.3 Describe how speech perception changes in early childhood and the role of experience-dependent neural plasticity in this process.
It is remarkable to consider how the flapping of a mosquito’s wings can set in motion a process that involves thousands if not millions of neurons. We have traced the pathway this process takes from the outer ear to the cerebral cortex, exploring the exquisitely sensitive mechanisms for amplifying acoustic waves, breaking them down into component frequencies, and converting them to neural impulses. We have followed these impulses as they jump from neuron to neuron along the auditory pathway, crossing the midline one or more times. This process is not just the passive transmission of information, however: at each station in the pathway, signals are filtered, refined, and compared between ears. This section will explore the complicated relationship between acoustic properties and perception.
Location
One of the most important functions of the auditory system is to locate nearby animals. If the other animal is a predator, knowing its location may enable it to be escaped. If the other animal is prey, knowing its location may allow it to be captured. In behavioral tests, humans can reliably pinpoint the location of sound sources to within several degrees (Oldfield and Parker 1984). The acuity of sound localization is highest in front of the head along the horizontal axis. In our own experience, the reflexive turning of the head to look at an unexpected sound is so natural and automatic, it may be difficult to appreciate how difficult the underlying calculations are. Yet evolution has solved this problem with incredible precision.
Interaural timing and level differences
The fact that animals have two ears is not simply a consequence of bilateral symmetry. Depending on the location of a sound source, the two ears will receive slightly different inputs. By comparing the inputs between the ears, the brain is able to determine the location of the source.
When we discuss auditory spatial cues, we are referring to location relative to the head (Figure 7.14). More specifically, what the brain needs to determine is the angle of the source relative to some reference direction, which by convention is usually taken to be the front of the head. The angle of the source has two components: the azimuth, which is the angle to the left or right on a horizontal plane, and the elevation, which is the angle up or down from that plane. Because the ears are separated from each other by the head along the plane of the azimuth, the azimuth of a sound source will affect the relative level and timing at which the sound reaches each ear.
Level differences between the ears arise because the head casts an acoustic shadow. Incoming sound waves tend to reflect off the head or to be absorbed by it, so the ear that is closer to the sound source will receive pressure waves with a higher amplitude. This effect is more pronounced for higher frequencies, because low-frequency sounds can more easily transmit through the head and diffract around it. The interaural level difference is greatest when the sound source is 90 degrees to the left or the right, and zero when it is directly in front of (0 degrees) or behind (180 degrees) the head.
Timing differences between the ears arise because the speed of sound is finite (343 m/s at sea level). At 90 degrees azimuth, the distance between the ears in humans is around 0.2 m, corresponding to a delay of about 600 µs. By comparing the timing at which sound arrives at the two ears, it is possible to determine its azimuth. For transient sounds, this comparison involves the difference between the times when the pressure wave’s onset arrives at either ear. This difference is called the interaural time delay. Interaural time differences are more relevant cues for the location of low-frequency sounds, whereas interaural level differences are more important at high frequencies.
The elevation of a sound source does not produce interaural differences in level or timing. Nevertheless, humans and many other species can determine the elevation of sound sources thanks to the irregular shape of the auricles. The cavities and ridges of the auricle create resonances that selectively amplify some frequencies while dampening others, creating “notches” in the sound spectrum. These resonances depend on both the elevation and azimuth of the sound source (Oldfield and Parker 1984). Thus, by determining where there are notches in the sound spectrum, it is possible for the brain to determine elevation.
The cues for distance are subtle, and much less is known about how this percept is formed (Kolarik et al., 2016). Without any obstacles, sound attenuates with the square of the distance, so there is no difference in intensity between a source two meters away from one that is twice the distance but four times the amplitude. However, sounds coming from greater distances have the opportunity to interact with more objects, resulting in greater reverberance. It is likely that the perception of distance relies on such cues.
Neural circuits involved in decoding location
The superior olivary complex is the primary brain structure involved in determining azimuthal location. The medial superior olive compares timing between the two ears, whereas the lateral superior olive compares levels between the ears (Figure 7.15).
One of the earliest models for how the brain might calculate timing differences is the coincidence detector. As proposed by Jeffress (1948), if a neuron receives input from both ears, neither of which alone is enough to make it spike, then it will only fire when the two inputs are active coincidently (i.e., at the same time) (see Chapter 2 Neurophysiology). If the pathways from the ipsilateral and contralateral ear are the same length, then coincidence will only occur when the sound arrives at both ears simultaneously, from directly in front (0 degrees) or behind (180 degrees). If the ipsilateral pathway is slightly shorter, then the neuron will respond best when the input arrives at the contralateral ear slightly before the ipsilateral one. By systematically varying the relative lengths of the axons, an array of such coincidence detectors would be able to represent timing differences as a place code, much as different frequencies are represented in a tonotopic map. In birds, the medial superior olive appears to implement just such a detector (Figure 7.16). It receives inputs from the contralateral and ipsilateral cochlear nucleus, and the incoming axons, in many species, branch to form an orderly array, with the shortest paths from the ipsilateral side at one end and the longest paths at the other. In further support of this model, electrophysiological recordings from the medial superior olive reveal that neurons in this area respond preferentially to specific interaural time differences (Joris et al., 1998). The calculation of interaural time differences in mammals uses a similar principle, but is implemented differently (Lesica et al., 2010).
Differences in level between the two ears are computed by the lateral superior olive, which receives excitatory input from the ipsilateral ear and inhibitory input from the contralateral ear (via the nucleus of the trapezoidal body). The firing rates of these inputs are proportional to the sound level in each ear. Neurons in the lateral superior olive add together the excitatory (positive) and inhibitory (negative) inputs, so their firing rate represents the difference in the sound level between the ears.
Spatial information from the superior olivary complex propagates forward through the auditory pathway, and as it does, inputs from the medial and lateral superior olive combine to integrate information about timing and level differences. This results in a coherent, unitary perception of sound location.
Perceptual contents
In addition to determining the locations where sounds originate, the auditory system also has to determine what is making the sounds. These complex identifications are based on patterns of simpler spectral and temporal features of the sounds. In other words, we can describe complex sounds like “bird singing” or “lion roaring” in terms of simpler perceptual features. In this section, we will examine some of these perceptual features and how they relate to the physical properties of the sound, then describe some of the neural circuits involved in decoding these features.
Loudness, pitch, and timbre
In 7.1 Acoustic Cues and Signals, we learned that acoustic waves can be described in terms of three quantities: amplitude, frequency, and phase. Simple, sinusoidal sounds have a single value for each of these quantities. Complex sounds are combinations of simple sinusoids, which means that they can be described as a spectrum of frequencies, each with its own amplitude and phase. The basilar membrane decomposes complex sounds into their constituent frequencies, but how are these complex patterns of activation perceived?
Acoustic amplitude (or intensity) is perceived as loudness. Sounds with higher amplitudes sound louder. The relationship between physical amplitude and perceived loudness is geometric (also called logarithmic), which means that proportional increases in amplitude are perceived as linear increments in loudness. Because intensity is usually reported using the logarithmic decibel (dB) scale (see Table 7.1), a linear step in decibels of intensity corresponds to a linear step in perceived loudness. In other words, the perceived difference in loudness between a 65 dB tone and a 75 dB tone is the same as the perceived difference between a 70 dB tone and an 80 dB tone. This relationship only holds if the cochlea is healthy and undamaged, and if the intensity is within the range of human hearing. Below 0 dB, sounds are inaudible, and above 120 dB, the perception of loudness is distorted by saturation (i.e., overstimulation) of the physical and neural processes of transduction.
Frequency is perceived as pitch. Similarly to loudness, this is a geometric relationship. A doubling of frequency corresponds to a perceived pitch interval of one octave, regardless of where you are on the musical scale.
Periodic sounds usually contain multiple frequencies. It is very rare to hear pure sinusoidal tones. When physical objects vibrate, they often produce a harmonic series. consisting of frequencies that are integer multiples of the lowest frequency. Such sounds are perceived as having a pitch corresponding to the lowest frequency in the series, the fundamental frequency. The numerical relationship between harmonics influences the perception of tonality, or how different pitches sound in relation to each other. For example, tones that are an octave apart sound “the same” because the harmonics of the higher note overlap completely with the harmonics of the lower note.
The higher harmonics (also called overtones) are not heard as distinct tones, but instead influence perception of timbre. Timbre is a qualitative rather than a quantitative percept (McAdams 2019). It may be difficult to describe how a clarinet sounds in words, but it is easy to distinguish it from a violin or a French horn, even when they are playing notes of the same pitch. This is because, as illustrated in Figure 7.18, the relative distribution of amplitude in the harmonics a clarinet produces is different from the distribution of amplitude in the same harmonics produced by a violin.
Surprisingly, the perception of pitch does not require hearing the fundamental frequency. Even when the fundamental frequency is filtered out from a note, listeners perceive the sound as having the same pitch (Plack and Oxenham 2005). This observation clearly shows that the perception of pitch is constructed by the auditory system and that it requires neural circuits to integrate information across multiple frequencies.
Neuroscience in the Lab
Psychoacoustic experiments
The study of how physical properties of acoustic stimuli relate to perception is called psychoacoustics. It is a branch of a broader field called psychophysics. Although psychoacoustics does not directly examine the brain, it has yielded many important insights into how the brain processes sound purely by examining behavior.
Psychoacoustic experiments require precise control over how stimuli are presented. They usually take place in specially constructed acoustic isolation chambers that strongly attenuate extraneous sounds. Stimuli are generated by computer programs that can synthesize sounds or manipulate digital recordings. Computers are also used to control presentation of different stimuli and to record the subject’s responses, which usually involve pressing keys rather than making qualitative descriptions. Human subjects can be instructed about how to respond, but nonhuman animals can also be used in psychophysics using operant conditioning or other forms of behavioral training (see Chapter 18 Learning and Memory).
Figure 7.17 illustrates two common psychoacoustic paradigms.
In a threshold experiment, subjects report whether they can hear a stimulus or hear the difference between two stimuli. A threshold plot is used to show the minimum value or difference required to elicit a response. In a discrimination experiment, subjects report whether stimuli belong to a group. For example, a subject might be required to indicate whether the second tone in a series is higher than the first. This type of experiment produces a psychometric curve showing the proportion of times the subject or subjects gave a specific response. Psychometric curves typically have a sigmoidal shape. The midpoint of this curve indicates the “psychological midpoint” at which the subject cannot discriminate which category the stimulus belongs to, and the slope of the curve indicates how sharply the subject discriminates between the categories.
Neural circuits decoding contents
Much less is known about how the brain decodes the perceptual contents of auditory stimuli. Natural sounds often comprise multiple frequencies and extend over time, so the neural circuits that decode contents and identify sources need to integrate over multiple frequencies and over time. Even the perception of loudness, though it primarily depends on amplitude, is not entirely straightforward. Amplitude is encoded at the earliest stages of audition. Amplitude affects the displacement of the basilar membrane and hair cell stereocilia, and the firing rate of auditory nerve fibers is proportional to amplitude. This representation of amplitude is maintained throughout the auditory pathway, with most neurons firing more rapidly to more intense stimuli. However, the perception of loudness also depends on duration. Shorter bursts of white noise sound quieter than longer bursts even when they are the same amplitude (Scharf 1978). This implies that the auditory system is keeping a memory of intensity over short intervals of time through temporal summation (or integration).
Conversely, the perception of pitch and timbre requires integration across multiple frequencies. Auditory nerve fibers are tuned to single frequencies corresponding to the region of the basilar membrane they innervate. Many neurons throughout the ascending auditory pathway share this selectivity, but increasingly larger proportions of neurons respond strongly to more than one frequency. These “multi-peak” neurons are prevalent in the auditory cortex (Winter 2005); interestingly, many of these neurons are tuned to frequencies that are integer multiples of a fundamental frequency, just like the harmonics in a complex periodic sound (see Figure 7.18). This suggests that these neurons may be involved in the perception of pitch (Wang 2013).
The auditory cortex is also likely to be the site where short-term and long-term memories of specific sounds and acoustic patterns are stored. Electroencephalography and neuroimaging studies in humans consistently show that responses to complex sounds or to deviations from expectations (a form of memory) are centered in the auditory cortex (Demany and Semal 2008) (see Methods: Electroencepaholography). In nonhuman animals, training to recognize or discriminate complex sounds results in neural plasticity within the auditory cortex, implying that the memories associated with the animals’ newly acquired perceptual abilities are formed within the cortex (Bao et al., 2013).
Speech and other communication signals
Acoustic communication signals are often complex, consisting of periodic and aperiodic sounds. The frequency and amplitude of these components are often dynamic, changing over time. To illustrate, Figure 7.21 shows a spectrogram of a song sparrow singing (top) and the author speaking a phrase (middle). Spectrograms are two-dimensional plots of how the spectrum of a complex sound changes over time. Frequency is shown on the y-axis, time on the x-axis, and the relative amplitude by the intensity or color of the plot. The sparrow song consists of a series of different kinds of tonal (periodic) sounds in rapid succession. Some of the tonal sounds have clear harmonics, indicated by parallel lines at integer multiples of the lowest frequency. In other sounds, there is only a single frequency, but it is rapidly modulated up or down to produce chirps and trills. The human speech consists of long tonal components, corresponding to vowels, transients spanning the whole range of frequencies, corresponding to consonants like “t”, “b”, and “k” where the flow of air is stopped by the tongue or lips, and broadband aperiodic noises, corresponding to consonants like “s” where turbulent air is passing through the teeth.
In this section, we will focus on human speech, one of the most important stimuli our auditory system has to process.
The acoustic and phonetic structure of speech
Human speech consists of a series of periodic and aperiodic sounds. Speech sounds are produced by the movement of air through the vocal tract. The parts of the vocal tract that are moved to shape the air stream and produce speech sounds are called articulators.
The main articulator for speech is the larynx, and more specifically, the vocal folds, bands of muscle tissue in the center of the airway. In normal breathing, the vocal folds are relaxed, and air is able to pass through freely. In speaking or singing, the folds are tightened and the air pressure from the lungs increases, causing them to vibrate open and closed. This vibration, also called phonation, produces a periodic wave that forms the basis for vowels.
In most languages, vowels are distinguished not by the fundamental frequency of the vocal fold vibrations, but by how this sound, which is rich in harmonics, is filtered in the upper vocal tract. Positioning the tongue and lips creates cavities of different sizes and shapes. The resonances of these cavities act as filters, producing peaks in the vocal spectrum called formants.
Consonants are phonemes that occur at the beginnings and ends of vowels. They can be produced by starting and stopping air flow through the vocal tract, or by restricting its flow through a narrow opening, producing turbulence.
As you might imagine, the human vocal system can produce a great variety of different vowels and consonants. Not every speech sound, or phone, is used in every language, however, and different phones may be used interchangeably. A group of phones that can be used interchangeably is called a phoneme. In English, the phones [r] and [l] are distinguished from each other: switching [r] for [l] causes the meaning of a word to change (for example “right” and “light”). The same is not true in Japanese: there is only a single phoneme that includes the whole range of sounds between [r] and [l].
Phonetic perception is categorical: acoustically dissimilar stimuli can be perceived as identical. This phenomenon was demonstrated by Alvin Liberman and Ignatius Mattingly, two pioneers of speech perception, using synthetically generated speech (Liberman et al., 1957; Liberman and Mattingly 1985). As illustrated in Figure 7.19, when the shape of the onset of a consonant-vowel syllable is systematically shifted, English-speaking listeners report hearing either “ba”, “da”, or “ga”. Even though the acoustic difference in each step is the same, this linear acoustic relationship is not reflected in what listeners perceive. Rather, listeners hear sudden transitions from one phoneme to another (there is no percept midway between “ba” and “da”), and they are unable to distinguish sounds within the same category from each other.
Categorical perception results in a many-to-one mapping between the acoustic structure of speech and phonetic perception. At some level of the auditory pathway, the brain must be responding the same way to different stimuli (Holt and Lotto 2010). This is a critical feature of vocal communication, because it allows us to understand the same words and phonemes produced by different individuals despite wide variations in pronunciation. It also allows us to understand the same phoneme produced by the same individual in different contexts, at different frequencies, and with different emphases.
Developmental Perspective: Normal development of phonetic perception
Different languages have different sets of phonemes, and categorical perception tends to follow the phonetic structure of the listener’s native language. In other words, individuals who grow up hearing English may not be able to hear the difference between sounds that form two distinct phonemes in another language. This observation suggests that phonetic perception is shaped by experience. Moreover, the older people get, the more difficult it is for them to learn to perceive the phonemes of a new language. They hear the new language through the lens, so to speak, of their native language, which makes it more difficult to rapidly process speech or correct an accent. These observations imply that the effect of experience on perception is limited to a narrow window during development, i.e. a critical period (see Chapter 5 Neurodevelopment).
Work by Patricia Kuhl and Janet Werker has revealed how phonetic perception is established early in life (Werker and Tees 1984; Kuhl et al., 2006). Prior to about 6 months of age, infants from all over the world, raised in environments where different languages are spoken, can discriminate between all of the phonemes found in all languages. However, by 9–12 months of age, infants become better at discriminating phonemes in the language of their parents or caregivers and worse at discriminating phonetic contrasts not found in that language (Figure 7.20). Children raised hearing Japanese lose the ability to distinguish [r] from [l]. In contrast, children raised in English-speaking environments stop being able to tell apart “dental” [t], made by placing the tongue against the teeth, from “retroflex” [T], formed by placing the tongue against the roof of the mouth; these sounds are distinct phonemes in Hindi but not in English. This shift, which commits the infant to a certain way of hearing speech, is echoed by changes in how the brain responds to native and nonnative phonemes (Bosseler et al., 2013). This commitment is necessary for normal language development (Tsao et al., 2004), but it makes it increasingly difficult for children to hear other languages as native speakers do. This could explain why children who begin learning a language later in life are more likely to speak with an accent (Piske et al., 2001), because they would have a harder time hearing their own mispronunciations and correcting their errors.
Neuroscience in the Lab
Measuring perception in infants and nonverbal animals
How is it possible to measure what an infant is hearing before it can speak or even understand questions? Kuhl, Werker, and their colleagues devised a clever technique that takes advantage of basic mechanisms of perception and learning (Werker and Tees 1984; Kuhl et al., 2006). As illustrated in Figure 7.20, infants sit on their mother’s lap while viewing a toy. As the infant watches the toy, experimenters play a series of consonant-vowel pairs like “ka”. At random intervals, the “ka” sound is replaced by a different sound like “ba”, and a few seconds later, a more exciting toy is revealed in a different direction. The infant’s gaze is drawn to the new visual stimulus, a natural orienting response that can be observed almost immediately after birth. Within several such trials, the infant learns that the change in sound predicts the exciting toy, and will turn its head to look at the toy even before it appears. The proportion of trials when the infant shifts its gaze is therefore a measure of how perceptually dissimilar the “oddball” is from the standard. Using this approach, Werker and Kuhl were able to show that native phonemes become more easily discriminated at 9–12 months of age, while nonnative phonemes become harder to tell apart.
Neuroscience Across Species: Experience-dependent plasticity in animal models
What happens as a child’s brain commits itself to a particular set of phonemes? Although it is not possible to study this at a cellular or molecular level in humans, work in nonhuman animal models has shed light on how experience could be reshaping how the auditory system processes sounds.
The auditory cortex in rats, as in other species, is organized by frequency in a tonotopic map (Figure 7.22). This organization is seen throughout the classical auditory pathway, so a reasonable hypothesis would be that the cortical organization is inherited via topological connections from earlier stations. This does not mean that the map is not able to change: work from Michael Merzenich and his colleagues has shown that the map is plastic early in life, subject to alteration by experience (Zhang et al., 2001). When a rat is raised in an environment dominated by tones of a single frequency, the map that results is dominated by that frequency. In other words, repeated exposure to 4 kHz tones results in more neurons over a larger area of cortex responding preferentially to that frequency. Similar effects are seen with other features like rhythm (Zhou et al., 2008). However, the map is only plastic for a limited period during development. The same treatment has little effect on older animals. This closely resembles what is seen in humans: once a child’s brain has committed to a specific language’s phonetic map, it becomes much more difficult to change (Kuhl et al., 2006; Bosseler et al., 2013).