Our perception of audio

andreasmaaan · Dec 28, 2018

j_j said:
There's not a short answer. To some extent, constant delay (let's call it that, linear phase is kind of misleading) is important, but it doesn't have to be exactly constant delay. Phase inversion in some frequencies matters, in others it does not as long as it doesn't change the signal envelope.

There is a lot to be said, but it requires a blackboard and time.

A look at binaural masking level depression might be a starter even though we are talking here about monaural issues.

Thanks. Have you written on this in more detail elsewhere, perhaps?

Thomas Lund · Dec 28, 2018

“Natural” because that’s how our senses tend to serve us, separating static from dynamic stimuli, based on a human time-scale: Air pressure, tone of light, movement (peripheral vision) etc. We’re predisposed to concentrate on agents rather than the setting, and loudspeakers functionally belong to the former group.

Considering their in-room frequency response, there’s a limit to what we are able to resolve as direct sound, depending e.g. on time, level, azimuth and frequency. In monitoring, it’s a must to compensate boundary loading effects so you don’t risk misjudging LF level wildly, often 6-18 dB.

j_j · Jan 8, 2019

andreasmaaan said:
Thanks. Have you written on this in more detail elsewhere, perhaps?

Not all in one place, but a trip through the tutorials at www.aes.org/sections/pnw (look at the powerpoints and past meetings tabs) might help some.

andreasmaaan · Jan 8, 2019

j_j said:
Not all in one place, but a trip through the tutorials at www.aes.org/sections/pnw (look at the powerpoints and past meetings tabs) might help some.

Many thanks,
A

youngho · Jun 7, 2023

j_j said:
Not all in one place, but a trip through the tutorials at www.aes.org/sections/pnw (look at the powerpoints and past meetings tabs) might help some.

@j_j Thank you for suggesting this site. I found a number of great presentations that you did. Very educational, though even some of the low level stuff was over my head, so to speak

youngho · Jun 7, 2023

j_j said:
There's not a short answer. To some extent, constant delay (let's call it that, linear phase is kind of misleading) is important, but it doesn't have to be exactly constant delay. Phase inversion in some frequencies matters, in others it does not as long as it doesn't change the signal envelope.

There is a lot to be said, but it requires a blackboard and time.

A look at binaural masking level depression might be a starter even though we are talking here about monaural issues.

@j_j , On the simplest level, is constant delay relatively less important at low frequencies <500 Hz and more important at high frequencies, like >2 kHz (although I saw that you wrote 4 kHz in other places) with a transition between those two points? How does the middle ear high pass filter at 700 Hz relate to this, since you wrote that phase changes rapidly where the amplitude is greatest?

Thanks so much,

Young-Ho

j_j · Jun 7, 2023

youngho said:
@j_j , On the simplest level, is constant delay relatively less important at low frequencies <500 Hz and more important at high frequencies, like >2 kHz (although I saw that you wrote 4 kHz in other places) with a transition between those two points? How does the middle ear high pass filter at 700 Hz relate to this, since you wrote that phase changes rapidly where the amplitude is greatest?

Thanks so much,

Young-Ho

MORE sensitive to waveform at low frequencies.

Sensitive mostly to envelope (inside of an ERB) shape and delay over 4khz.

In the middle, well, it's in the middle. The turnover point is in the neighborhood of 1000Hz, which is also the maximum firing rate of an inner hair cell, and yeah, that kind of sets the bar.

youngho · Jun 8, 2023

j_j said:
MORE sensitive to waveform at low frequencies.

Sensitive mostly to envelope (inside of an ERB) shape and delay over 4khz.

In the middle, well, it's in the middle. The turnover point is in the neighborhood of 1000Hz, which is also the maximum firing rate of an inner hair cell, and yeah, that kind of sets the bar.

Oops, sorry, I realize that I got that completely backwards. I'm reading through and trying to get a basic handle on your presentations like Hearing 096 and Auditory Mechanisms for Spatial Hearing now, and I should have read what you wrote about phase lock at low frequencies.

Is it possible to very briefly explain where constant delay may be less important and where phase shift might not matter? Like around the middle ear high pass? Or perhaps again closer to 1.5-2 kHz due to shift from ITD to ILD, also head shadowing? Or perhaps 1-3 kHz where localization seems to be poor, according to this review:
https://www.sciencedirect.com/science/article/pii/S187972961830067X (although I realize that perhaps the issue here was the use of tones, so no "leading edge" to guide localization)? Or simply 500-2 kHz where the two detector mechanisms conflict?

Thank you again, @j_j

Young-Ho

j_j · Jun 8, 2023

Between 500 to 2kHz things work very differently, because the two mechanisms conflicted, sometimes quite destructively.

If you use tone bursts, you can separate out the issue of tone lock from envelope lock, but below 500Hz nerve firing synchs with phase quite solidly. At 1000kHz it starts to break down because of the inability of the inner hair cell to fire that fast. While some results have shown some level of synchrony to 4 or 5 kHz, the sensitivity is mostly gone.

Leading edge is absolutely the thing above 2kHz, though. Even could be claimed to be "the thing" anywhere, frankly, with a narrow wideband pulse. In such cases, the loss of sensitivity is not as big in the 1k-2kHz range, too, but it's (redacted) hard to characterize.

Bob from Florida · Jun 8, 2023

amirm said:
Frantz post this on WBF forum and I thought it also belongs here. It is the most fundamental concept in audio evaluation and we need to get on the same page on it. The text is my response to his post.

---

While this is responsible for some of the faulty observations, just as big of a factor is the elasticity of our perception of audio. When evaluating new additions to our systems, we become far more attentive. We are dying to know if the new addition made a difference. We pay far more attention to what is played and as a result, hear detail, nuances, etc. that we did not when we were just enjoying music. What happens then is what you say: we attach those improvements to the new device and bias makes sure that when we go back to "before" configuration, we don't hear those improvements.

This is a very difficult thought exercise but when faced with this situation, try to see if you can hear the same differences in the old configuration. Having done that, all of those improvements appear in the older configuration too! And by the same token, you can take them out of the new config/tweak.

This is why blind tests work better. There, you apply the analytic technique to both samples, not just one. And without identity, whatever you think is different, cannot be associated with the new config/tweak.

The above explains why our evaluation of audio products can be faulty even when "we didn't expect to hear a difference." Or, "I expected it to sound worse." In both cases, we still listen more attentively and as a result hear more detail whether we expected it or not. This happens to me all the time even though I am hyper aware of listener bias. The above thought exercises and blind testing are the only way I can pull myself out of false conclusions.

I can't tell you how many times I have run a blind test and read more detail, resolution, lower noise floor into one sample, only to have all of those observations be false. And then be able to hear all of that into the other sample and not the first!

Without checks and balances that blind testing provides, i.e. holding the truth card, we can lead ourselves to completely wrong conclusions about the products we are evaluating and our ability to do so. And once lost in the forest, anything goes from there on.

BTW, I hope you don't mind me stealing your post for ASR Forum . I won't be posting more in this thread but do like to continue the discussion there.

Never having visited the WBF I Googled it. The first hit made be chuckle and is below.

solderdude · Jun 8, 2023

banana

youngho · Jun 8, 2023

j_j said:
Between 500 to 2kHz things work very differently, because the two mechanisms conflicted, sometimes quite destructively.

If you use tone bursts, you can separate out the issue of tone lock from envelope lock, but below 500Hz nerve firing synchs with phase quite solidly. At 1000kHz it starts to break down because of the inability of the inner hair cell to fire that fast. While some results have shown some level of synchrony to 4 or 5 kHz, the sensitivity is mostly gone.

Leading edge is absolutely the thing above 2kHz, though. Even could be claimed to be "the thing" anywhere, frankly, with a narrow wideband pulse. In such cases, the loss of sensitivity is not as big in the 1k-2kHz range, too, but it's (redacted) hard to characterize.

@j_j , in terms of >2 kHz, could you possibly explain a little further when you wrote in https://www.icsi.berkeley.edu/icsi/sites/default/files/events/events_1304_johnston.pdf that "a perceptually indirect signal (a term applying only above 2kHz or so) will have a flat envelope, and thereby provide no information to correlate. In other words, the flat envelopes will in fact ‘correlate’ but the auditory system has no features to lock onto to determine either direction or diffusion. Such signals are often ignored in the most complex stages of hearing, but will have an unnatural sensation to them when observed?"? If I understand correctly, Bech had suggested spectral energy >2 kHz in reflections were responsible for spatial contributions, so is the idea that the reflections arrive while the relevant ERBs are still compressed, so there would be no envelope leading edge onset for frequencies >2 kHz to "lock onto," resulting in inability to identify any directionality for this part of the spectral content?

Young-Ho

Tim Link · Jun 8, 2023

Floyd Toole said:
In Chapter 17 I discuss hearing loss, and in section 17.3 address a disturbing recent discovery: hidden hearing loss. In addition to the well known elevation of hearing thresholds as a function of "wear and tear" in the cochlear mechanisms, humans with - and without! - such losses can exhibit a disability in binaural hearing. We are less able to distinguish multiple sources in space, and one has to assume, aspects of space itself. This being so, the suspicion is that our ability to adapt to listening spaces, and to separate sources from venues, deteriorates as well (anybody have problems carrying on discussions in restaurants?). Hearing loss itself is commonly associated with degraded peripheral apparatus, but this new factor seems to involve more central auditory processing. Future developments will be interesting to follow.

I started to suspect this kind of binaural/spacial loss back in my early 40s when I first experimented with using a divider wall between stereo speakers. What I heard with the crosstalk reduced remined me of the kinds of effects I got from just any stereo system when I was a kid. The stereophonic effect was pretty much always powerful and mesmerizing, mind blowing. With age it seemed to have dulled, and the divider wall reducing the crosstalk seemed to bring a bunch of it back.
Another disturbing discovery for myself was just how prevalent loss of balance sense from the inner ear becomes as we age. They can pretty much tell how old you are by how long you can balance on one foot with your eyes closed. I tried it and was very deflated by my performance.
It's interesting just how much value we put on sonic effects as audiophiles. As a young person, I was horrified at the idea of hearing loss. I realized I'd rather lose hearing than sight, but that was a practical consideration and not a passionate feeling. Now that I'm older and have experienced the damage and degradation I'm happy to know that I'm still enjoying music, resigned to my continuing losses and no longer mourning. I'm at peace with tinnitus.

Maybe future generations will enjoy the benefits of treatments or methods that preserve excellent sensory capabilities much further into old age.

I wonder if people who have more spacial hearing capacity intact are more able to tease apart a standard 2 speaker stereo setup's crosstalk anomalies than those of us with lost capacity. That might explain why some people claim to hear nothing of interest from crosstalk reduction in terms of soundstage width, depth and clarity.

j_j · Jun 8, 2023

youngho said:
@j_j , in terms of >2 kHz, could you possibly explain a little further when you wrote in https://www.icsi.berkeley.edu/icsi/sites/default/files/events/events_1304_johnston.pdf that "a perceptually indirect signal (a term applying only above 2kHz or so) will have a flat envelope, and thereby provide no information to correlate. In other words, the flat envelopes will in fact ‘correlate’ but the auditory system has no features to lock onto to determine either direction or diffusion. Such signals are often ignored in the most complex stages of hearing, but will have an unnatural sensation to them when observed?"? If I understand correctly, Bech had suggested spectral energy >2 kHz in reflections were responsible for spatial contributions, so is the idea that the reflections arrive while the relevant ERBs are still compressed, so there would be no envelope leading edge onset for frequencies >2 kHz to "lock onto," resulting in inability to identify any directionality for this part of the spectral content?

Young-Ho

If the envelope is very smooth, you're reduced to amplitude differences between the ears above 2kHz (a tiny bit works to 4k, but not much). In almost any space, the reverberation will be very similar at both ears. If the envelope varies (either randomly or in a delayed fashion) to the two ears, you get a very different sensation. If it has leading edges and they align, that will image to front or back (depending on the timbre), if they are within +-.8 or so milliseconds, they will pull to the earlier side, and if they are wildly different, it will give you a sense of "distant".

There are other variations, but what this comes down to, for instance, is a pulsed 10kHz tone is easily localized. A continuous 10kHz tone is (*&( impossible to localize in most acoustic situations.

Tim Link · Jun 8, 2023

I did a little test using headphones and a 10kHz tone. No sense of direction with a .5 ms delay. If I create leading edges by processing the tone into a 100Hz square wave tremolo, the delay caused the tone to seem to come from the direction of the earlier signal.

With a 100Hz sine wave tone the direction can be detected without the need for leading edges.

Tim Link · Jun 9, 2023

RayDunzl said:
I could think it could be related to the "audio illusion" where you hear a garbled and unintelligible message, then hear it ungarbled, and then can "clearly" understand the original garbled version. The effect wears off, come back a week later, listen only to the garbled, and it is unintelligible again.

~~https://soundcloud.com/whyy-the-pulse/an-audio-illusion~~

I can be wrong, as this is an extreme case (and speech processing related) compared to usual audio differences, but still...

That's pretty amazing! Reminds me of those people who hear messages in records played backwards, or static radio sounds recorded at cemetaries, supposedly voices of the dead. They tell you what you're suppose to hear, and by golly I hear it!

j_j · Jun 9, 2023

Yep.

Btw the appropriate delays are .1 to .8 or so milliseconds.

youngho · Jun 9, 2023

j_j said:
If the envelope is very smooth, you're reduced to amplitude differences between the ears above 2kHz (a tiny bit works to 4k, but not much). In almost any space, the reverberation will be very similar at both ears. If the envelope varies (either randomly or in a delayed fashion) to the two ears, you get a very different sensation. If it has leading edges and they align, that will image to front or back (depending on the timbre), if they are within +-.8 or so milliseconds, they will pull to the earlier side, and if they are wildly different, it will give you a sense of "distant".

There are other variations, but what this comes down to, for instance, is a pulsed 10kHz tone is easily localized. A continuous 10kHz tone is (*&( impossible to localize in most acoustic situations.

Thank you, @j_j . I think I almost have the beginning of a handle on some of the concepts here, and I really appreciate your taking the time to respond to a complete layman like me. I have to tell you that the ERB compression description was my first (unless I missed one previously) encounter explaining a physical basis of the Haas effect.

From what I vaguely comprehend on the most basic level, the concept of envelope ITD processing modifies the traditional duplex Raleigh model (if I understand it correctly) of ITD <1500 and ILD >1500. This correlates with Griesinger discussing "Human perception is particularly sensitive to sounds with sharp onsets." At higher frequencies (I assume the ~2 kHz is related at least in part to the width of the human head, also with physiological mechanisms of limits of inner hair cell firing like you describe, although I understand that it's not a hard limit, hence 2-4 kHz), the CNS interprets some combination of ILD and envelope onset/alignment when available (like from perceived direct sound) for lateralization, along with timbre for front/back depending on pinna/HRTF. I assume that +0.8 ms is related to the Haas/precedence effect. Otherwise, "diffuse" or "perceptually indirect" if no ILD or envelope onset/alignment. Ron Sauro was dismissive about mathematical diffusers only being measurably effective at higher frequencies (https://www.stereophile.com/content/nwaa-labs-measurement-beyond-atomic-level-page-2), but it almost seems to me like they only need to be effective at randomizing phase and hence randomly varying envelope onset at such, anyway.

If, indeed, the wavelength of the human head coincides with envelope leading edge detection taking over completely (again, if I understand correctly), what might be a physiological basis for the 500 Hz for waveform detection? I'm still trying to understand the implication of the 700 Hz middle ear high pass filter, also how you describe "critical bandwidths are about 100Hz up to 700Hz, and 1/3 octave thereafter" (https://www.icsi.berkeley.edu/icsi/sites/default/files/events/events_1304_johnston.pdf)." A lot seems to happen around this 500-700 Hz range! Griesinger describes “Frequencies below 500Hz are primarily responsible for perceptions of Resonance, Envelopment, Warmth," whereas @Thomas Lund write "With both stereo and immersive, for your room and system to be able to reliably convey the envelopment latent in the content, perceived-direct sound should dominate in the 50 to 700 Hz range – where audible patterns characteristic of the recording space may have been picked up" (https://www.genelec.com/-/blog/how-to-analyse-frequency-and-temporal-responses), also "From 50 Hz to 700 Hz, however, fast-firing synapses in the brainstem are responsible for localisation, employed in a phase-locking structure to determine interaural time difference (ITD)" (From 50 Hz to 700 Hz, however, fast-firing synapses in the brainstem are responsible for localisation, employed in a phase-locking structure to determine interaural time difference (ITD)."

If the auditory mechanism is more sensitive to phase at lower frequencies, how does the modal region affect that perception, not just at much lower frequencies (you wrote about interaural phase differences 40-90 Hz creating a sensation of space, similar to what Griesinger and @Thomas Lund discuss, but rather between, say, 90-300 Hz, if at all?

Thanks again,

Young-Ho

j_j · Jun 9, 2023

youngho said:
Thank you, @j_j . I think I almost have the beginning of a handle on some of the concepts here, and I really appreciate your taking the time to respond to a complete layman like me. I have to tell you that the ERB compression description was my first (unless I missed one previously) encounter explaining a physical basis of the Haas effect.

The latency of the inner hair cell (only fires at most once per millisecond) also contributes.

youngho said:
I assume that +0.8 ms is related to the Haas/precedence effect. Otherwise, "diffuse" or "perceptually indirect" if no ILD or envelope onset/alignment.

It's the time delay around the human head, or maybe just a bit more, depending on your hat size. Seriously! .8 or .9 milliseconds is the limit, give or take, once you get over a couple of milliseconds the time delay doesn't make sense to the hearing apparatus at all. Cvetkovic et. al. has shown that some exageration of the interaural time delay may enhance imaging perception, but also that too much becomes hard to process, and then a bit more and it all falls apart.

youngho said:
what might be a physiological basis for the 500 Hz for waveform detection?

It's "leading edge" of the signal after being filtered by the respective cochlear filter. It's that simple. The inner hair cell has no trouble firing synchronously at 500Hz or below, so that simply sends time delays on into the brain based on whatever happens to be the "leading edge". Now, leading edge is a poorly defined term here, it is NOT the leading edge of a sine wave into the ear, rather the phase is heavily modified during the cochlear filtering process.

youngho said:
I'm still trying to understand the implication of the 700 Hz middle ear high pass filter, also how you describe "critical bandwidths are about 100Hz up to 700Hz, and 1/3 octave thereafter" (https://www.icsi.berkeley.edu/icsi/sites/default/files/events/events_1304_johnston.pdf).

That's an older deck, I would advise using "ERB's" now, so about 40-50 Hz wide until they are 1/4 octave wide, at which proceed by 1/4 octaves. The bark scale (critical bands) has perhaps been supplanted by newer work, but the basic idea holds. (Scharf's work was very good for its time, but it was a LOT of data, and there were no computers to crunch all of it when Critical Bands (or Barks) were developed, and they are close enough to offer some useful description. If you look at masking reports for critical bandwidths, they vary in dB terms across frequency. On the other hand, using ERB's, the masking effects become very similar at all frequencies, except for some interesting behaviors at very low frequencies related in some difficult to digest fashion to leading edge definition. Sorry, don't have a clear answer there.

The 700 Hz highpass filter is a simple result of the tension and stiffness of the ear drum. What it means is that a cloud floating overhead starting a thunderstorm does not, via pressure drop, cause an unbearable noise level. Consider, dropping by .1 PSI corresponds to a LOUD noise. 14.7 PSI is equal to 194 dB SPL in pressure at the eardrum, and a drop of .1 PSI therefore is about 150dB SPL, of course happening at milliHz or MicroHz. That kind of level at the cochlea is very dangerous, but of course we can neither hear it nor does it bother us at such low frequencies. If there was no HP filter wind would be unbearably loud.

Remember SPL of ZERO dB is 2x 10^-10 atmospheres, give or take. Ears are very sensitive, and yes, we all beat the crap out of ours in the modern world.

As to the 500-700Hz or other such numbers, I do believe it's related both to neural latency and time delay around the head. Remember the maximum firing rate of a neuron is about 1kHz but being biological, it's rather more complicated than just 1kHz. That's without considering the outer hair cell depolarization and such that change the system sensitivity abruptly, and so on.

youngho · Jun 9, 2023

j_j said:
The latency of the inner hair cell (only fires at most once per millisecond) also contributes.

@j_j Thanks! Probably both phenomenon are related. I tried finding descriptions of precedence effect in other mammals, and with a quick search, only came up with one paper mentioning it in manatees, though only theoretically.

j_j said:
It's the time delay around the human head, or maybe just a bit more, depending on your hat size. Seriously! .8 or .9 milliseconds is the limit, give or take, over a couple if milliseconds the time delay doesn't make sense to the hearing apparatus again.

Yes, it makes total sense that head size (wavelength) is directly related to shadowing (frequency, hence time). I was thinking about this as human head is 140-145 mm width, so ~5.7" or 0.475", so ~2379 Hz or 0.47 ms, but probably I should have considered from half-wavelength, so if signals arrive in phase (or close enough) higher than that, their envelopes align sufficiently enough to promote lateral localization, or am I confused again?

j_j said:
It's "leading edge" of the signal after being filtered by the respective cochlear filter. It's that simple. The inner hair cell has no trouble firing synchronously at 500Hz or below, so that simply sends time delays on into the brain.

I guess I was wondering if there was some extra-auricular explanatory physiological correlate for 500 Hz, since you mentioned that the latency of the inner hair cell was at most 1000 Hz or so, while ~2 kHz seemed to be related to the width of the human head. Average shoulder width is apparently 16" or 41 cm, about 850 Hz, so half-wavelength would be a little lower than the 500 Hz range--silly as it sounds, I was trying to think of correlates to this frequency range, maybe sound reflecting/diffracting off the human torso.

j_j said:
That's an older deck, I would advise using "ERB's" now, so about 40-50 Hz wide until they are 1/4 octave wide, at which proceed by 1/4 octaves. The bark scale (critical bands) has perhaps been supplanted by newer work, but the basic idea holds. (Scharf's work was very good for its time, but it was a LOT of data, and there were no computers to crunch all of it when Critical Bands (or Barks) were developed, and they are close enough to offer some useful description.

Thank you, I'll update my notes.

j_j said:
The 700 Hz highpass filter is a simple result of the tension and stiffness of the ear drum. What it means is that a cloud floating overhead starting a thunderstorm does not, via pressure drop, cause an unbearable noise level. Consider, dropping by .1 PSI corresponds to a LOUD noise. 14.7 PSI is equal to 194 dB SPL in pressure at the eardrum, and a drop of .1 PSI therefore is about 150dB SPL, of coure, happening at milliHz or MicroHz. That kind of level at the cochlea is very dangerous, but of course we can neither hear it nor does it bother us at such low frequencies. If there was no HP filter wind would be unbearably loud.

As to the 500-700Hz or other such numbers, I do believe it's related both to neural latency and time delay around the head. Remember the maximum firing rate of a neuron is about 1kHz but being biological, it's rather more complicated than just 1kHz. That's without considering the outer hair cell depolarization and such that change the system sensitivity abruptly, and so on.

Yes, simple minds like mine automatically try to simplify things. My question regarding the 700 Hz high pass filter had to do with constant delay sensitivity around this frequency,

My next key question would be about the effects of the modal region (in smaller rooms) on phase and spatial perception thereof, apart from swirling or relatively rapidly changing interaural phase differences at lower (40-90 Hz) frequencies.

Thanks again,

Young-Ho

Our perception of audio

Master Contributor

Member

Major Contributor

Master Contributor

Senior Member

Senior Member

Major Contributor

Senior Member

Major Contributor

Major Contributor

Grand Contributor

Senior Member

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Major Contributor

Senior Member

Major Contributor

Senior Member

Similar threads