Psychoacoustics self-education links sharing

youngho · Jun 15, 2023

Hi,

Having read Sound Reproduction but wanting to learn more, I skimmed the following links/resources for self-education and wanted to share with others. I've also attached my notes (apologies for any inaccuracies or misrepresentations! and selected quotes. The most important lessons for me were @j_j 's explanation of a physiological basis for the Haas or precedence effect, the concept of multiple overlapping ERBs that might explain pitch perception far greater than 1/3 octave, waveform vs envelope for ITD, and frequency-specific localization cues.

Hartmann 1999
https://www.cogsci.msu.edu/DSS/2019-2020/Hartmann/Hartmann_1999.pdf
How we localize sound (pre-millenium survey of the field for laymen)
Diffraction/scattering around human head, signifiant shadow by 4 kHz. Secondary scatterers aid localization at higher frequencies.
ILD small <500 Hz if source >1m away, smallest detectable change ~0.5 dB, independent of frequency, cues large and reliable >3 kHz
Smallest detectable change in ITD ~1-2 degrees azimuth near forward direction, ITD sensitivity decreases above 1-1.5 kHz
ATF/HRST boosts ~1 kHz for waves from behind, near 3 kHz from front. Above 4 kHz, outer ears and pinnae scatter significantly, quite individual-specific above 6 kHz mostly with valley-and-peak structure that shifts with frequency, peaking near 7 kHz with source directly overhead. Thus, narrow-band sources may seem to localize depending on frequency, not source location
ILD dominates at high frequencies, ITD at low
ITD vulnerable to room reflections and reverb, since depends on binaural cross-correlation, no useful coherent sound in reverberated sound, so ITD unreliable in large room where reflected sound dominates
Reflecting surfaces tend to absorb with increasing frequency,, so reflected power relatively smaller. Since ILD frequency-independent, listeners tend to use highest freq info available
Precedence effect based on earliest arriving waves of sound onset, "often thought of as a neural gate that is opened by the onset of a sound, accumulates localization information for about 1 ms, and then closes to shut off subsequent localization cues," another model is "strong reweighting of localization cues in favor of the earliest sound"
"Spectral differences caused by anatomical filtering for sources in the midsagittal plane" also seem to affect precedence effect

Hartmann 2005
https://web.pa.msu.edu/acoustics/koller.pdf
Binaural coherence in rooms
"The ITD arises physically because a sound wave coming from a direction to the listener's right takes longer to reach the left ear than to reach the right ear–hence, an interaural time difference. The interaural time difference may be as large as 800 microseconds for a source at 90 degrees azimuth with respect to the listener’s forward direction, but the human binaural system is capable of localizing a sound based on an ITD of only 10 microseconds, leading to a sensitivity of about 1 degree of arc."
"Although the waveform phases are similar in the two ears, the ITD, defined as a phase delay, is equal to the phase difference divided by the frequency. Therefore, there may still be a significant ITD because the frequency is low. The ITD may contradict the ILD and be discounted."
"The long-wavelength region extends out to about 500Hz, where the coherence shows a local minimum...Within the long-wavelength region below 500 Hz the span of ITDs measured in our environments was often large. The spans were found to correlate with the coherence. although the low-frequency peaks may have coherence that is high enough to make them influential, these peaks are not necessarily useful."
"A survey of the binaural literature shows that experimenters have often focused on the 500 Hz region for listening experiments. That is because the binaural system, especially the ITD-sensitive part, works best in that region [20]. The special role of 500 Hz in binaural neural processing may not be entirely fortuitous. Because of the size of the human head, this frequency range corresponds to a minimum in coherence when the soundfield is isotropic. If a listener is required to localize a source in the presence of an interfering reverberant field that is approximately isotropic, then any peak that occurs in this frequency region is likely to come from the direct sound from the source and not from the environment. Consequently it is to the listener’s advantage to pay special attention to the 500 Hz frequency region."
"As the frequency of a sound increases beyond 1000 Hz, there is a substantial degradation in the ability of the binaural system to make use of ITD in the waveform. Timing in the fine structure of a tone or noise ceases to be of value. Instead, listeners are able to make use of ITD in the envelope of sounds. If there is no structure in the envelope, as for a continuous sine tone, then listeners cannot localize. For noise, like the third-octave noises used in our experiments, the ITD in the temporal fluctuations can be used. Given the significance of envelope ITDs at mid and high frequencies, it would seem that the waveform cross correlation and waveform coherence, as measured in the experiments reported here, are less interesting than the cross-correlation and coherence of the envelope. However, the coherence of the waveform and of the envelope are statistically related."

Johnston 2013
https://www.aes-media.org/sections/pnw/ppt/jj/heyser.pptx
Heyser lecture
“What you like to listen to is PREFERENCE, not “accuracy”. You listen to what you prefer to hear, not what is measurably more accurate, unless of course, you prefer a good measurement. Preference is inviolate!”
“The SNR experiences teach the artistic side to ignore the engineer. The lack of DBT’s and testability teach the engineers to ignore the artist.”
“SNR is mostly harmless”

Yost 2015
https://acousticstoday.org/wp-conte...ychoacoustics-A-Brief-Historical-Overview.pdf
“Helmholtz used Fourier’s theorems to describe a resonance theory of frequency analysis performed by the inner ear as the basis of pitch and argued that the resonance place with the greatest magnitude would be a determining factor in pitch perception. Because his inner ear resonators were more sharply tuned at low frequencies, low frequencies were likely to be a dominant factor in pitch perception. “
“Schouten (1940) formulated his “residue theory,” which suggested that the missing fundamental pitch was based on the temporal amplitude envelope of the missing fundamental stimulus that would exist in a high-frequency region after the sound was transformed by inner ear filtering processes.”
“Rayleigh [1907] argued that the interaural level difference (ILD) was a possible cue at high frequencies where the ILDs would be large due to the head shadow, and an interaural time (phase) difference could be a cue at low frequencies."
“As director of acoustical research (see Allen, 1996), Fletcher oversaw a litany of psychoacoustic research achievements unmatched in the history of the field,5 which included measurements of auditory thresholds (leading to the modern-day audiogram, the gold standard for evaluating hearing loss), intensity discrimination, frequency discrimination, tone-on-tone masking, tone-in-noise masking, the critical band, the phon scale of loudness, and the articulation index. “
“Fletcher originally conceived of critical bands in terms of both loudness and masking (Allen, 1996). The critical band is a frequency region that is “critical” for masking and/or loudness summation (the masking definition is used most often). The threshold for detecting a tonal signal masked by a noise is proportional to the power in a critical band of masker frequencies surrounding the signal frequency. The critical band is modeled as a bandpass filter similar to the action of the biomechanical properties of cochlear processing (Moore, 1989; Table 1). Eberhardt Zwicker (Table 1) in Germany using primarily loudness data developed similar critical-band measurements. The bandwidths of Zwicker’s critical bands are referred to as the Bark Scale (Zwicker and Fastl, 1991). The gammatone filter bank is a current manifestation of critical-band filters (Patterson et al., 1995).”
“Licklider (1956) also developed a “triplex” theory of hearing in which he proposed an autocorrelator for pitch processing. Pitch perception was studied extensively in the Netherlands and Germany by psychoacousticians such as Reinier Plomp, Burt de Boer, Frans Bilsen, and Ernst Terhardt. The autocorrelation approach of Licklider (and its later variations, Meddis and Hewitt, 1991) can extract temporal regularity from a sound as a basis of pitch. However, there are equally successful models of pitch that are based on the spectral structure of a sound. Julius Goldstein, Ernst Terhardt, and Fred Wightman each developed successful spectrally based models of pitch perception. As mentioned previously, the debate about spectral versus temporal accounts of pitch perception continues today (Yost, 2009).“

Risoud 2018
https://www.sciencedirect.com/science/article/pii/S187972961830067X
“Sound source localization consists in determining the position of the source of a sound in 3 dimensions comprising 2 angles and 1distance: azimuth (azimuth angle ) in the horizontal (or azimuthal) plane 0 ± 180◦;; elevation (height or vertical angle ) in the vertical plane: 0 ± 90◦;; and distance () in depth: 0 ± ∞. Three main physical parameters are used by the auditory system to locate a sound source: time, level (intensity) and spectral shape. Horizontally, the azimuth is mainly determined by binaural factors, involving both ears: i.e., interaural time and level differentials. Vertically, height is determined monaurally, involving just one ear: i.e., changes in incident spectral shape (reflection, diffraction and absorption) brought about by the pinna, head, shoulders and bust, known as the head-related transfer functions (HRTF). Depth distance is mainly determined monaurally.”
“The auditory system assesses ITD by low-frequency phase-shift for wavelengths exceeding head diameter, high-frequency envelope shift for wavelengths shorter than head diameter.”
“ITD is fundamental in locating sound sources at frequencies below 1500 Hz [12], but becomes ambiguous at higher frequencies [13]…Thus, for pure tones higher than 1500 Hz, phase-shift ceases to be relevant, as several wavelengths may have followed one another between one ear and the other (Fig. 3). In the case of complex sounds, on the other hand, ITD remains relevant beyond 1500 Hz thanks to the perceived difference in arrival time of the sound envelope between the two ears, sometimes known as the interaural envelope difference. However, envelope cues play little if any role in determining azimuthal localization in a free field.”
“The head masks sounds: this shadow effect reduces intensity, especially at higher frequencies [10]. For wavelengths shorter than head diameter, the head partially decreases acoustic energy by reflection and absorption. Thus, the lowest frequency at which the shadow effect occurs is approximately: fmin = v max = 343 0.175 ≈ 1960 Hz…ILD is thus virtually zero below 1500 Hz, and becomes relevant for wavelengths shorter than head diameter (> 1500 Hz) “
“ITD and ILD provide precise localization in the azimuthal plane, with the exception of what is known as the “cone of confusion”[18]. For sounds coming from the circumference of this cone, the axis of which is the interauricular line, there are no time or level differences, leading to confusing perceptual coordinates: the subject is unable to tell whether the sound is coming from in front or from behind, above or below, or from anywhere else along the circumference (Fig. 4). For any sound source with coordinates Δ, α, θ, there is a mirror-image position (Δ, 180 – α, −θ) with similar ITD and ILD. Dynamic, spectral and visual perceptual disambiguation strategies therefore developed.”
“Sound stimulus frequency greatly affects the accuracy of localization [12,22,23], which is best for low frequencies (< 1000 Hz), poorest between 1000 and 3000 Hz, and intermediate for high frequencies (> 3000 Hz)…The accuracy of sound source localization thus depends on:•azimuthal position: better in front than to the side;•type of stimulus:◦ band width: the wider the band, the better the accuracy,◦ frequency: poorer between 1000 and 3000 Hz,◦ and speech or tonal type of sound.”
“Human subjects were shown to be able to be able to determine monaurally the vertical localization of high but not low-frequency sounds, probably due to the small size of the pinna, which allows it to interact only with short-wavelength sounds [38]. Sounds can be accurately located vertically only if they are complex; they include > 7000 Hz components; the hearer’s pinna is present [39]”
“Determining the distance of a sound source mainly depends on monaural cues, and is much easier for familiar sounds [41].Generally speaking, close distances tend to be overestimated and long distances underestimated…The direct-to-reverberant energy ratio [19] is the first distance cue….Initial time delay gap (ITDG) is the time gap between the arrival of the direct sound wave and the first strong reflection…Level is also a distance cue, distant sources giving rise to lower perceived level... Spectrum is another distance cue, high frequencies being more quickly muffled by the air: air absorption coefficient is higher the higher the sound frequency”

Johnston 2019
https://www.aes-media.org/sections/pnw/ppt/jj/aes_apr2019_hearing099.pptx
Hearing 096 (precursor to hearing 101)

External acoustics: Sounds interacting with head “filtered” by head-related transfer functions, depends on head shape and size, pinna shape and location; also intramural time difference, which can be manipulated independently
Middle ear consists of three bones and ear drum, acts mostly as impedance transformer, includes first-order high-pass filter operating at about 700 Hz (accounts for much of lower frequency threshold increase), acoustic impedance rises at high frequencies (hence much of high frequency threshold elevation), canal provides boost at one or two resonant points
“Inner ear”: oval window connects to stapes so conducts sound by vibrations, round window on other side of organ of Corti lets energy in endolymph out, basilar membrane helps separate the scala timpani (space connected to the ear drum) from the scala vestibuli (space connected to the air behind the dear drum. Inner hair cells do almost all of detection, outer hair cells seems to provide compression by changing stiffness and therefore membrane tunings. Setup acts as nonlinear filter bank, high frequencies at input, low frequencies at far end. ~2500 inner hair cell groups with overlapping filter characteristics, bandwidth estimated as ERBs, conventionally simplified from 2500 down to about 90. Fast phase shift provides information at low frequencies, as inner hair cells fire synchronously on the “moving together” direction.
CNS: sound entering the ear, then loudness “integration,” feature analysis, and auditory object analysis, but last two affected by cognitive and other feedback like expectation

Johnston 2021
https://www.aes-media.org/sections/pnw/ppt/jj/auditory_mechanisms_01_28_21.pptx
Auditory mechanisms for spatial hearing 2021
Ear detectors fire at start of waveform itself <500 Hz and waveform envelope >~2 kHz with firing rate proportional to loudness but mixed mechanisms at mid-frequencies
~3500 (?) inner hair cell clusters, bandwidth about 1 era but not symmetric about center frequency (more sensitive below than above), ERB ~40-50 Hz at low frequencies and about 1/4 octave at frequencies where 1/4 octave > 40-50 Hz. Cut-off at lower frequencies about 25 db/ERB once past 1/2 ERB below center frequency, varies with SPL of sound input. Cutoff towards higher frequencies once past 1/2 ERB above center frequency incredibly steep, due to basic filtering mechanism.
Loudness compresses within ERB with ~1 ms delay (emphasizes leading edge) at onset but not outside bandwidth, so adds across different ERBs, releases slower with frequency-variant time constants difficult to determine
With two ears, ITD is comparison of time relationships between waveform < 500 Hz, mix of waveform and envelope 500 Hz to 2 kHz, envelope leading edge >2 kHz. Leading edges in ERB emphasized relative to steady state due to onset of compression (part of Haas effect). ILD communicated to brain.
Direct sound includes nonlinear effects of air and some diffusion but arrives first. Auditory compression after 1 ms helps emphasize direct sound
Tones have flat envelopes, so no “first arrival” information. Many reflections off small surfaces at high frequencies [>2 kHz], which interfere with each other and ITD, so hard to localize.
Early reflected sound (longer path length up to ?4-5 m [up to 15 ms?]) reflect timbre of sound source, can affect timbre, localization, and width
Diffused reflections have leading edges somewhat scrambled, will add to loudness in most causes, may create sense of distance, will not usually mess up timbre
Late specular reflections generally bad, can garble articulation, creates echoes over ~50 ms delays, effectively create new leading edges
Reverberation tends to have less treble than direct sound
Diffuseness perception influenced by scrambled and different arrival time between ears across frequency
Distance perception: signal to reverb ratio, decorrelation of leading edge across frequency (more=distance)

Additional comments by @j_j
https://www.audiosciencereview.com/forum/index.php?threads/our-perception-of-audio.501/page-17
“The latency of the inner hair cell (only fires at most once per millisecond) also contributes” to Haas effect
0.8 ms is “the time delay around the human head, or maybe just a bit more, depending on your hat size. Seriously! .8 or .9 milliseconds is the limit, give or take, once you get over a couple of milliseconds the time delay doesn't make sense to the hearing apparatus at all. Cvetkovic et. al. has shown that some exageration of the interaural time delay may enhance imaging perception, but also that too much becomes hard to process, and then a bit more and it all falls apart.”
“It's "leading edge" of the signal after being filtered by the respective cochlear filter. It's that simple. The inner hair cell has no trouble firing synchronously at 500Hz or below, so that simply sends time delays on into the brain based on whatever happens to be the "leading edge". Now, leading edge is a poorly defined term here, it is NOT the leading edge of a sine wave into the ear, rather the phase is heavily modified during the cochlear filtering process.”
“If you look at masking reports for critical bandwidths, they vary in dB terms across frequency. On the other hand, using ERB's, the masking effects become very similar at all frequencies, except for some interesting behaviors at very low frequencies related in some difficult to digest fashion to leading edge definition.”
“The 700 Hz highpass filter is a simple result of the tension and stiffness of the ear drum. What it means is that a cloud floating overhead starting a thunderstorm does not, via pressure drop, cause an unbearable noise level. Consider, dropping by .1 PSI corresponds to a LOUD noise. 14.7 PSI is equal to 194 dB SPL in pressure at the eardrum, and a drop of .1 PSI therefore is about 150dB SPL, of course happening at milliHz or MicroHz. That kind of level at the cochlea is very dangerous, but of course we can neither hear it nor does it bother us at such low frequencies. If there was no HP filter wind would be unbearably loud.”
“Remember SPL of ZERO dB is 2x 10^-10 atmospheres, give or take. Ears are very sensitive, and yes, we all beat the crap out of ours in the modern world.”
“As to the 500-700Hz or other such numbers, I do believe it's related both to neural latency and time delay around the head. Remember the maximum firing rate of a neuron is about 1kHz but being biological, it's rather more complicated than just 1kHz. That's without considering the outer hair cell depolarization and such that change the system sensitivity abruptly, and so on.””

Kinnunen 2022
https://www.theseus.fi/bitstream/handle/10024/749081/Kinnunen_Teemu.pdf
“The human ear is most sensitive between 2 000 – 5 000 Hz, largely due to resonances in the ear canal.”
“Hearing is most linear at the 80 dB loudness.”
“As seen in Figure 5, the resonance of the ear canal is about 10 dB at 2 500 Hz. The exact frequency is determined by, among other things, the length of the ear canal, and it is the most characteristic resonant of the HRTF (Head-Related Transfer Function). Another large resonance is generated in the concha, the horn-like structure of which collects sound and directs it towards the ear canal. This resonance is of the order of about 9 dB and settles at 5 kHz. Another influencing factor is the slight accentuation due to the reflections of the pinna at 3 kHz. At higher frequencies the reflections begin to form a slight comb filter effect. The head, body and other parts of the body also contribute to resonances. All these together form an ear resonance of about 17 dB at the center frequency of 2700 Hz, where real-ear audiological measurements generally place the ear resonance (Staab 2014). “
“Attenuation in frequency response called “Pinna Notch" can be seen usually around 10 kHz but depending on the individual it can vary. At these high frequencies, the reflected signal from pinna is out of phase with the direct signal, and destructive interference occurs. It also has a greater effect on the sound coming from above than the sound coming directly from the front, and this can help to localize vertical sound. (University of California Davis 2011.)”

Psychoacoustics self-education links sharing

youngho

Addicted to Fun and Learning

Similar threads