• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Bass and subwoofers

Dr. Griesinger mentions extracting ITD from the positive-going zero crossings. Is that what you mean here, or something else?

Go watch the hearing tutorial at the PNW site as long as it's still up.

At low frequencies, positive going on the basilar membrane, yes. Above 800Hz that starts to lose importance, about 2kHz for all practical purposes (there is some remaining waveform sensitivity to 4k,but it's overwhelmed for the most part), the leading edge of the filtered envelope.

For your use in the bass, I'd make an 8th order fit to the cochlear filter in an IIR minimum-phase filter (match both skirts, which is moderately tricky), or in an assymetric FIR filter (it will be long but computers are fast) that is a better fit, and make sure you convert that filter to pure minimum phase, and then look at the positive going waveform in each 1/3 ERB.
 
In recreational listening, it can be argued music should primarily rather sound nice and comforting.

Absolutely. As an aside, it would be nice if people would throw out level-only panpots, but I'm not holding my breath, because low rate codecs that insist on things like "intensity stereo" will utterly destroy a properly done signal. :(
 
Last edited:
Go watch the hearing tutorial at the PNW site
I watched your 2019 "Human Hearing" talk as well as "Auditory Mechanisms for Spatial Hearing". Much of it was new info to me, so I learned a lot.

For your use in the bass, I'd make an 8th order fit to the cochlear filter in an IIR minimum-phase filter (match both skirts, which is moderately tricky)
Would you mind giving me a hint as to which set of data I should be looking at? There seem to be a number of somewhat different measured curves and quite a few cochlear filter models having different shapes and slopes (especially toward higher frequencies).

Off topic, but I was intrigued by your comment about audible pre-echo in some SRCs, so I looked through your SRC slides. I was not aware that there were/are SRCs with an impulse response like that. Fortunately, I didn't make that kind of mistake with the one I designed—I do oversampling/decimation in the frequency domain along with a windowed sinc filter having an absurd amount of stopband attenuation:
filter.png
Seems to do OK with your test signal (48kHz → 88.2kHz):
src_48k_to_88_2k.png
It's not even particularly slow (thanks to FFTW3 :)).
 
Last edited:
  • Like
Reactions: j_j
Well, I'd use 25dB/erb slope on the high frequency side, 15dB slope on the lower frequency side, and flat in the middle except for a 6dB bump in the middle third.

(edited to correct my own mixing of position vs. frequency, duh)
 
Last edited:
15dB/erb slope on the high frequency side, 25dB slope on the lower frequency side
Hold on... I was under the impression that the filters tend to be steeper on the high frequency side. Am I missing something?
 
The HF cutoff is very sharp, because the energy has passed through the basilar membrane (much like a transmission filter). I dare say I said it backwards? Did I? Yeah, I answered from the POV of the basilar membrane position. I'll fix it.
 
Well, I'd use 25dB/erb slope on the high frequency side, 15dB slope on the lower frequency side, and flat in the middle except for a 6dB bump in the middle third.
After reading some stuff on auditory filter bank models, I decided to use the "differentiated all-pole gammatone filter" formulation, at least for now. It can be made to fit the suggested shape reasonably well and is simple to implement. The Q can be modulated to simulate cochlear compression (along with the associated center frequency shift) if I ever decide to try adding that.

I implemented a detector that uses the positive-going zero crossings in each band to estimate ITDs, but then realized that there's a problem with just using the raw times since the phase obviously wraps at ±180°. That also got me wondering what the hearing apparatus actually does with ITDs in excess of the 800µs or so you'd get in a free field with a low frequency source at ±90° azimuth. I did a few casual experiments (non-blind, etc.) myself with insert earphones, then went looking for some proper data. It seemed to me that the perceived azimuth gets much of the way to ±90° with an ITD of 800µs, then more slowly approaches maximum lateralization as the interaural phase goes to something around ±90° (i.e. 2.5ms ITD at 100Hz). This appears to agree with the few studies I was able to find. Interaural phase significantly beyond ±90° sounds less wide; I seem to get sort of a weird dual image effect.

I could presumably just unwrap the ITDs, but given the above I'm thinking that this might not actually be the best approach. Should fluctuations perhaps be weighted more strongly when the ITDs are the ±800µs range?
 
After reading some stuff on auditory filter bank models, I decided to use the "differentiated all-pole gammatone filter" formulation, at least for now. It can be made to fit the suggested shape reasonably well and is simple to implement. The Q can be modulated to simulate cochlear compression (along with the associated center frequency shift) if I ever decide to try adding that.

I implemented a detector that uses the positive-going zero crossings in each band to estimate ITDs, but I realized that there's a problem with just using the raw times since the phase obviously wraps at ±180°. That also got me wondering what the hearing apparatus actually does with ITDs in excess of the 800µs or so you'd get in a free field with a low frequency source at ±90° azimuth. I did a few casual experiments (non-blind, etc.) myself with insert earphones, then went looking for some proper data. It seemed to me that the perceived azimuth gets much of the way to ±90° with an ITD of 800µs, then more slowly approaches maximum lateralization as the interaural phase goes to something around ±90° (i.e. 2.5ms ITD at 100Hz). This appears to agree with the few studies I was able to find. Interaural phase beyond ±90° sounds less wide; I seem to get sort of a weird dual image effect.

I could presumably just unwrap the ITDs, but given the above I'm thinking that this might not actually be the best approach. Should fluctuations perhaps be weighted more strongly when the ITDs are the ±800µs range?

Your question is actually quite complicated. Yes, you should always unwrap the phase, but you can't unwrap the phase from DC because there is zero at DC. What happens when you exceed the maximum interaural difference by more than a bit is that the image moves into two places, more so with more delay, over time. This is something that has not been very well analyzed to my knowledge, but that is known to happen at other frequencies, below 800Hz or so by looking at waveform, and above 2kHz or so by looking at the envelope rather than the waveform. (between the two, somebody needs to do some basic research on the contributions of sine wave timing vs. envelope, but this question is quite confused by the issue of the "crossover" frequency being quite close to the wavelength corresponding to the maximum ITD, mixing with the firing rate and latency of the inner hair cell, in short, it's quite a mess.

I don't know the answer specifically, while I know ways around it, I'm not sure I'm able to discuss that at present.
 
Your question is actually quite complicated.
I added a simple nonlinear ITD-to-azimuth mapping function which had an interesting (though not unexpected) effect on the reported metric. For low-passed stereo pink noise, the metric is only slightly reduced when applying a binaural HRTF approximation. In contrast, the older method of using weighted cross-correlation reports a dramatically reduced metric with the binaural HRTF.

Of course, how it should behave in this case is not entirely clear to me. I was initially thinking that this prediction would be easily testable, but I hadn't previously listened to uncorrelated low-passed noise over earphones before... In my limited tests, the simplified HRTF case is easily distinguishable from mono, but also sounds very different from uncorrelated noise (i.e. without the HRTF). The former sounds narrower but externalized to an extent while the latter produces two images that are unnaturally wide sounding and stuck on either side of my head. In other words, to me, the former has an enveloping quality while the latter doesn't really.

As I rather figured would be the case, accurately and robustly predicting perceived LF envelopment probably can't be done via a simple method. Auditory perception is complicated (but I certainly don't need to tell you that!).
 
I added a simple nonlinear ITD-to-azimuth mapping function which had an interesting (though not unexpected) effect on the reported metric. For low-passed stereo pink noise, the metric is only slightly reduced when applying a binaural HRTF approximation. In contrast, the older method of using weighted cross-correlation reports a dramatically reduced metric with the binaural HRTF.

Of course, how it should behave in this case is not entirely clear to me. I was initially thinking that this prediction would be easily testable, but I hadn't previously listened to uncorrelated low-passed noise over earphones before... In my limited tests, the simplified HRTF case is easily distinguishable from mono, but also sounds very different from uncorrelated noise (i.e. without the HRTF). The former sounds narrower but externalized to an extent while the latter produces two images that are unnaturally wide sounding and stuck on either side of my head. In other words, to me, the former has an enveloping quality while the latter doesn't really.

As I rather figured would be the case, accurately and robustly predicting perceived LF envelopment probably can't be done via a simple method. Auditory perception is complicated (but I certainly don't need to tell you that!).

One useful test signal is to create two stereo signals, one narrowband noise in the frequency of interest identical in both channels, the other uncorrelated in the two channels. Then you run various combinations of the two, like 100% mono, 100% independent, 50/50, 80/20 and so on. Again if there are hard results for this, I haven't seen them. I have used them to break codecs, and for some codecs, "break" is very, very accurate. :)
 
One useful test signal is to create two stereo signals, one narrowband noise in the frequency of interest identical in both channels, the other uncorrelated in the two channels. Then you run various combinations of the two, like 100% mono, 100% independent, 50/50, 80/20 and so on.
I tried something similar (varying mid/side levels, which should be equivalent) and found that if the cross-correlation is ~0.8, the metric reported using my ITD-to-azimuth scheme is about the same as the uncorrelated case. At ~0.9, it is reduced by about 10%.

Listening with earphones, a cross-correlation around 0.8-0.9 (bandpass centered at 80Hz) seems to result in "realistic" width. Zero correlation is again unnaturally wide.

for some codecs, "break" is very, very accurate
When you say "break", do you mean that they mangle the spatial quality?
 
When you say "break", do you mean that they mangle the spatial quality?

You could say that. The uncorrelated 's' noise images in a different place than the signal, unmasking a great ****load of noise. It's, um, 'striking'.
 
A video that I bumped into that I find interesting:


IMHO, the same goes for stereo bass and this points out to yet another circle of confusion. Either they don't monitor the low end properly and throw stuff away just in case, or they do care about it but we don't hear it because we think this very low end information is too "quiet" to be useful and worth investing.

I think further investigation is needed about thresholds of perception and how "quiet" of an uncorrelated signal and how low in frequency can be perceived as enveloping. And all that in the midst of mono signals with sharp envelope that are orders of magnitude louder. The case for it being that (as pointed out in the video), big and clean systems really don't need that much of information down low in order to move a lot of air.
 
Last edited:
A video that I bumped into that I find interesting:


IMHO, the same goes for stereo bass and this points out to yet another circle of confusion. Either they don't monitor the low end properly and throw stuff away just in case, or they do care about it but we don't hear it because we think this very low end information is too "quiet" to be useful and worth investing.

I think further investigation is needed about thresholds of perception and how "quiet" of an uncorrelated signal and how low in frequency can be perceived as enveloping. And all that in the midst of mono signals with sharp envelope that are orders of magnitude louder. The case for it being that (as pointed out in the video), big and clean systems really don't need that much of information down low in order to move a lot of air.
Ok about the bass, but what was really interesting was watching RTA hitting -15dB at 16kHz.

We already know the trend of using white noise at modern genres but that's beyond what we normally see at the average song charts.
TC is probably fast enough to catch them? Or they ditch them during the mixing so they end up at -50's as we usually see?
 
Ok about the bass, but what was really interesting was watching RTA hitting -15dB at 16kHz.

We already know the trend of using white noise at modern genres but that's beyond what we normally see at the average song charts.
TC is probably fast enough to catch them? Or they ditch them during the mixing so they end up at -50's as we usually see?

Probably some rule of thumb considering tonality, if there's a lot of low's you also need more highs for the overall spectrum not to sound dull, and vice versa (if you cut the low end to soon, then the highs may sound overly bright). I don't know, it's just my opinion. I've seen a lot of highs in transients that reach low, for example, but that's usually miliseconds.
 
Of course, how it should behave in this case is not entirely clear to me. I was initially thinking that this prediction would be easily testable, but I hadn't previously listened to uncorrelated low-passed noise over earphones before... In my limited tests, the simplified HRTF case is easily distinguishable from mono, but also sounds very different from uncorrelated noise (i.e. without the HRTF). The former sounds narrower but externalized to an extent while the latter produces two images that are unnaturally wide sounding and stuck on either side of my head. In other words, to me, the former has an enveloping quality while the latter doesn't really.
Interesting.

There's this track that I find highly enveloping (in room):


I'm always amazed by the tools that they have and can create something just by turning a dial or two.

This particular track has all sorts of tricks in it and I wonder if VLF fluctuations are responsible for perception of envelopment. At least to me, there are some interesting patterns.

This is excerpt from this track, L(blue), R(yellow):


Stereo.jpg


This is mono overlay in red:

Mono overlay.jpg


Just mono:

Mono.jpg


When summed to mono, not only that much of the transients are lost...

Transient.jpg


But when listening on the headphones, it's exactly as you describe, image being unnaturally wide and stuck on either side of the head. And transients are sounding rather thin.

Phase around the 35Hz fundamental, L/R/Mono:

Phase.jpg


Phase1.jpg


VLF fluctuations at 12, 7, 5, 4 and 3Hz:

VLF.jpg
 
Last edited:
There's this track that I find highly enveloping (in room):

Auditory Envelopment (AE) is LF fluctuation in time and level between the left ear and the right ear, a primary inducer of sensing space.

Uncorrelated LF noise in headphones leaves a supernatural feeling. However, it is a useful anchor, e.g. when developing terminology for an entire, overlooked dimension, or when designing elements of stereo or 3D distribution or reproduction.

The above is therefore a great discussion, because what should actually be required from standard reproduction?

In stereo, with fine floorstanders, a constructive listening room helps to establish enjoyable AE, regardless if L/R differences were recorded with microphones, or produced like the Biome example. Such help is missing in headphones, which on the other hand provides access to unnatural and remarkable extremes deprived of L/R crosstalk.

In 3D formats, without bass mismanagement, a wider and dynamic AE dimension is available than with in-room stereo, unless the listening room dominates the source. 3D headphone reproduction is currently Wild West in all different ways, and a random virtual listening room without reverence for AE splashed on top.
 
Auditory Envelopment (AE) is LF fluctuation in time and level between the left ear and the right ear, a primary inducer of sensing space.

Uncorrelated LF noise in headphones leaves a supernatural feeling. However, it is a useful anchor, e.g. when developing terminology for an entire, overlooked dimension, or when designing elements of stereo or 3D distribution or reproduction.

The above is therefore a great discussion, because what should actually be required from standard reproduction?

In stereo, with fine floorstanders, a constructive listening room helps to establish enjoyable AE, regardless if L/R differences were recorded with microphones, or produced like the Biome example. Such help is missing in headphones, which on the other hand provides access to unnatural and remarkable extremes deprived of L/R crosstalk.

In 3D formats, without bass mismanagement, a wider and dynamic AE dimension is available than with in-room stereo, unless the listening room dominates the source. 3D headphone reproduction is currently Wild West in all different ways, and a random virtual listening room without reverence for AE splashed on top.
Thanks to @Thomas Lund ’s insightful and profound advice, I’ve come to enjoy stereo bass.
When listening to a recording that contains embedded AE information, setting it to mono bass routing is like giving up before even trying to listen — his comment carries that nuance, and I find it incredibly clear and, in a way, self-evident.
 
Auditory Envelopment (AE) is LF fluctuation in time and level between the left ear and the right ear, a primary inducer of sensing space.
Evaluated through ERB-like filters. That is a very important consideration.
Uncorrelated LF noise in headphones leaves a supernatural feeling. However, it is a useful anchor, e.g. when developing terminology for an entire, overlooked dimension, or when designing elements of stereo or 3D distribution or reproduction.
This goes back to something I've said many times. There are an infinite number of ways to decorrelate channels. Most of them sound weird. Some sound good.

In order to solve that, you need to consider what happens in natural spaces.

(pretty much agreed)
In 3D formats, without bass mismanagement, a wider and dynamic AE dimension is available than with in-room stereo, unless the listening room dominates the source. 3D headphone reproduction is currently Wild West in all different ways, and a random virtual listening room without reverence for AE splashed on top.

The key being "without bass management".

There are also things possible in headphones, but they are not simple, and headphone listening must, absolutely, consider head movement properly. Furthermore, it's necessary to allow the listener to explore around their position to learn the local soundfield. It's not just "wear and it works". That's not how the higher parts of the human work.
 
Back
Top Bottom