• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

For speech sounds, the inner ear sends additional info to brain?

MrGoodbits

Member
Forum Donor
Joined
Sep 6, 2018
Messages
63
Likes
110
Location
Knoxville, Tennessee
Maybe the ear is a little more complicated than I thought. This article discusses new research about how the inner ear process speech:

Discovery of inner ear function may improve diagnosis of hearing impairment

I always imagined the cochlea as essentially an FFT sensor, each hair cell measuring one frequency, and each cell having a direct nerve connection to somewhere in the brain. This research says that, for speech-like sounds, the inner ear sends additional info to the brain. This additional info concerns the "slowly varying envelope of the outermost pattern of speech sound". The inner ear decodes this envelope and sends the signal separately. Is this envelope signal going to the brain on a sort of side-band of nerves, separate from the direct nerves from the hair cells? Fascinating!
 

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,403
Very interesting, thanks :)

In answer to your question, there doesn't appear to be any sideband of nerves. Rather, the information is transmitted via the same hair cells to the same auditory nerves, the difference being that depolarisation of these hair cells is not caused by the movement of the basilar membrane, as had been assumed to always be the case.

From the original study:

"...the envelope is encoded mainly through electrical distortion generated by the hair cells, which allows information about the envelope of high-frequency signals to be transmitted to the brainstem despite the limited bandwidth of the auditory nerve. This is somewhat reminiscent of the demodulation to baseband signal processing used in telecommunications systems....

"Although the hearing organ has long been known to generate distortion products21, which are useful for diagnostic purposes36,37, they are generally viewed as by-products of sensory transduction and nonlinear basilar membrane motion38,39. Previous measurements of high-frequency distortions established that their amplitudes increased as the separation between the stimulus frequencies became smaller40. This is noteworthy, because many behaviorally relevant sounds are harmonic complexes with small separation between components41. Here we demonstrated that the amplitudes of all of the distortion components that we were able to record, whether high or low in frequency, depended on the shape of the envelope (Fig. 6a–d), an effect that has not previously been described but may be perceptually important."
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
That's a pretty sophisticated research! Thanks for sharing.

Partially explains why a human inner hair cell has up to 30 nerve fibers connected to it. Strictly speaking, only 3 are absolutely necessary, to code the vibrations in the 0-10 dB, 10-30 dB, and 20-80 dB ranges. It was thought that the "excess" ones were used for redundancy, so that a subset of them could propagate spikes while other fibers were recharging.

Now it turns out that some of the "excess" fibers are in fact used for transmitting the envelope tracking data. This is indeed significant!
 

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
Looks like I was right! :)(when talking about why an inverted backwave from an open baffle speaker is a bad idea)
Cosmik said:
... if you believe that hearing also works on the basis of correlation of the shape of one-off events in the time domain, an inversion may be significant in messing up that mechanism.

It's all about not second-guessing how hearing works in order to justify post hoc a speaker technology that has an arbitrary characteristic that you would never have designed into it deliberately. Just to save on the woodwork

And also
Cosmik said:
This, to me, misses an important point - and is in keeping with the general claim that "music is just sine waves" (repeated by Ethan recently).

Music is transients. Transients are sine waves only if you window a portion of the signal and pretend that it repeats forever. A tuning fork or open guitar string ringing in response to a single frequency within a sound resembles how human hearing detects 'sine waves', effectively providing the brain with the time domain envelope of each frequency.

A spectrogram is a representation of this. In order to find the ITD - a far more robust measure of location than ILD - the brain doesn't need to compare phases of sine waves and simply give up at high frequencies: as part of its localisation strategy it merely needs to slide two spectrograms over each other and look for the highest correlation.

I suggest this is why, when you are listening to stereo over (good) speakers, you are not hearing the source wavering about with frequency content as your statements above, would suggest. Instead, you are performing a time domain correlation on 'spectrograms', tracking each acoustic source as an integral object.

This is why stereo over speakers works so well - and why it is not just the loose, nebulous 'effect' that people (who have probably never heard it done properly?) suggest it is

The main implication is that people are deluding themselves if they think they can look at a frequency domain representation of time domain events and conclude that they know how their speaker system will sound. There are an infinite number of ways to cobble together identical frequency responses once phase is disregarded and when smoothing is used to make it more 'understandable' visually. And all these different responses will be distorted in different ways in the time domain - the part of hearing that audiophiles generally disregard because the phase-free frequency domain makes everything look simple.

As the 'revelation' in the OP suggests, the only logical way to design an audio system is to make it straight in both frequency and timing responses, rather than trying to second guess how hearing works - or retro-fitting an idea of how hearing works to a convenient measurements system...

So every deviation you make from 'straight' takes you into the area where unknown aspects of hearing are signalling strange things to the brain. Timing misalignments between drivers, inverted & resonant envelopes from bass reflex ports, backwaves from open baffle speakers etc. There are many ways to get to the 'perfect' frequency response, but not many ways to get there while maintaining integrity in the time domain too.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,279
Likes
4,785
Location
My kitchen or my listening room.
Well, I have to dispute this "unknown aspects of hearing" comment. The ear does a time/frequency analysis, where the analysis bandwidth and thence time response varies by at least a factor of 40 from 20-20kHz. At low frequencies (under 500Hz) the ear phase locks to actual signals. At high frequencies (above 2khz, more or less, with some arguments about nits) the ear responds to the signal envelope in the cochlear filter bandwidth. Between 500 and 2kHz or so, both mechanisms work to some extent, and some parts of time sensitivity actually get worse, both in a single ear and in binaural resolution.

Now, not only does this happen, but leading edges (in signal at low frequencies) or envelope (at high frequencies) are also of utmost importance, because once they arrive, the outer hair cells begin to depolarize, and change the gain of the cochlea around that frequency. Again, hardly news. Look for "precedence effect". Look at the date. (cough)

The bit about multiple neurons with synapses to each inner hair cell is really kind of bogus the way it was described here, there is a lot going on, including the tuning of the firing of outer hair cells, etc, and the 0-10, 10-20 ... thing is just a great big nope.

Time relationships between different frequencies in a pitch synchronous fashion can help disambiguate a particular source of speech in a noisy environment.

I wouldn't say that the otoacoustic emissions are a surprise, since outer hair cells do change length a bit when they depolarize.

Now, something to realize is that cochlear filters are quite broad, they do not have a 'single frequency' response, rather each inner hair cell's response overlaps substantially with adjacent cells in terms of signal bandwidth, but of course the phase relationships change as you move along the basilar membrane.

I'm pretty sure that signal envelope has been discussed before, but perhaps not in this particular context.

As to "frequency response" that includes phase. Please.

Looking at a magnitude response can hide a multitude of sins. If you have access to octave or matlab, take the FFT of some bit of music say 2^20 long. Scramble the phase response. Do the ifft.

matlab would be something like

clc
clear all
close all

len=2^20;

y=wavread('your wave file name here') ; % I am assuming 44/16 here, and stereo, note. Or use "audioread".

x=y(1:len,1:2); % throwing away therest of the track. If it's too short you have to emend this.

x=fft(x);

for ii=1: (len/2)
t=2*pi*rand();
x(ii,1)=x(ii,1)*(cos(t)+i*sin(t));
t=2*pi*rand();
x(ii,2)=x(ii,2)*cos(t)+i(sin(t));
end

x(len:-1: (len/2+2),1:2)=conj(x(2: (len/2),1:2)); % gotta be consistent in the negative frequency range.


x=ifft(x);
%you need to check for overload here.

t=max(max(abs(x)));
x=x/t*.99;
wavwrite(x,44100,16,'weirdosound'); % or audiowrite if you have a new matlab




Compare to the original.

Nobody with a clue argues that magnitude response alone is sufficient. So that's a straw man that's already been turned to ash.

and this board changes matlab range expressions into smiley faces. Bollix
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,279
Likes
4,785
Location
My kitchen or my listening room.
Oh, and an "FFT" is a terrible model of the cochlea. When you use an FFT you set yourself up to one time/frequency resolution.

T'aint like that.
 

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
Oh, and an "FFT" is a terrible model of the cochlea. When you use an FFT you set yourself up to one time/frequency resolution.

T'aint like that.
Fine, my comments aren't meant for you.

But if you look through this forum you will find thread after thread about 'target curves', all based on FFT measurements. People poring over the 'in-room response' - without phase* - as though it tells them what they will hear. People who will do whatever it takes to get to that target curve. 'Room correction' algorithms (i.e. glorified graphic equalisers)? Aim a reverse phase tweeter at the ceiling? Take the back off the speaker? Saw a hole in the box? No problem, as long as it fills in that little dip just there. Phase? Time domain? Point source? Integrity of the envelope? Correspondence between direct and reverberant sound - and not just in terms of frequency response? Not an issue. "See, the smoothed frequency response measured with a FFT (without phase) is perfect."

* Because in-room phase looks like chaos. But it doesn't sound like it, because our ears are doing the complex time-frequency stuff you talk about and using it to separate the room from the direct. Etc. But try to tell the FFT people that.
 
Last edited:

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,403
Looks like I was right! :)(when talking about why an inverted backwave from an open baffle speaker is a bad idea)

By that logic, wouldn't a dipolar speaker be the only correct speaker for reproducing an open-backed drum (or a recording of any other dipolar acoustic event)?

...inverted & resonant envelopes from bass reflex ports...

The output from a bass reflex port isn't inverted (except below the tuning frequency). If it were, there would be a huge suck-out at and around the tuning frequency due to cancellation between the sound from the direct radiator and the port.

And the output from a bass reflex system is minimum phase, just like the output from a sealed box system. The group delay is greater at a given cutoff only because the slope of the HPF is steeper.

Once again I provide the modelled responses of a Visaton AL-200 in two 80L boxes, one ported (black trace) and one sealed (grey trace), EQ'd to have the same frequency response (I cheekily threw in diaphragm displacement too):

1545168568624.png


No need to rehash this one again of course ;)

I'm interested in your response to the question about dipoles though...
 
Last edited:

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,403
As to "frequency response" that includes phase. Please.

Didn't quite catch your drift here @j_j. Are you talking about non-minimum phase systems? And are you talking about how things are or how they should be?
 

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
By that logic, wouldn't a dipolar speaker be the only correct speaker for reproducing an open-backed drum (or a recording of any other dipolar acoustic event)?
Clearly we need an object-based system where only dipolar instruments are routed to reverse phase drivers on the back of the speaker (I claim the patent :)).

But seriously.... the recording contains the results of the original venue's reflections of the dipole source's inverted wavefront already. The recording system isn't generally intended to recreate the individual instruments 'dry' like a fairground organ (in which case maybe that special case might be the way to do it).

All it can do in the general case is to reproduce the composite recording as straight as possible. The listening room adds a smattering of ambience that our brains are tuned to 'hear through' (although not quite so easy to claim for stereo), but if we invert a portion of the signal and spray it at a wall, we create reflections of something that is simply not in the recording. Only by saying to ourselves "The frequency response/phase plot for reflections looks so complex that we can regard them as 'de-correlated' and just ambience stuff", can we pretend that flipping a switch between positive and negative wiring for speakers mounted behind our main speakers wouldn't sound any different - could be a good demo as to why open baffle speakers give 'out of this world' sound stage.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,279
Likes
4,785
Location
My kitchen or my listening room.
The main implication is that people are deluding themselves if they think they can look at a frequency domain representation of time domain events and conclude that they know how their speaker system will sound....

The mistake is when people (not necessarily you) seem to think that "frequency domain" ignores phase.
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
The bit about multiple neurons with synapses to each inner hair cell is really kind of bogus the way it was described here, there is a lot going on, including the tuning of the firing of outer hair cells, etc, and the 0-10, 10-20 ... thing is just a great big nope.

Sounds like you caught me on oversimplification. Guilty as charged :)

What I meant is that, according to prevalent current understanding, there are three types of auditory nerve fibers: High|Medium|Low Spontaneous discharge Rate (HSR|MSR|LSR). Indeed, mappings of dB HL to Spiking Rate (SR) in those are different for different species, individuals, center tuning frequency, OHC condition and adaptation via efferent fibers etc. However, majority of such mappings do appear to share the following characteristics:
  • HSR useful dynamic range is narrow, on the order of 10 dB, starts at the hearing threshold, and HSR HL->SR mapping function is very steep.
  • MSR useful dynamic range overlaps with HSR's, is shifted up the dB axis, is as narrow or a bit wider than HSR's, with less steep mapping function.
  • LSR useful dynamic range overlaps with MSR's, is shifted further up the dB axis, and is significantly wider than MSR's, with shallow mapping function.
Kind of what's shown here: https://www.sciencedirect.com/science/article/pii/S0378595517303477#fig5.

Evolutionary, it makes perfect sense: the effective resolution of a bio-ADC connected to a fiber doesn't usually exceed 8 bit, due to limitations of the "serial link", so it makes sense to use three ADC units, with two of them digitizing attenuated input signals, to cover with sufficient resolution the range of sound levels needed for survival.

It wasn't completely clear, let's say ten years ago, how exactly the tracking and encoding of envelope happens. Phenomenologically, researchers understood that there ought to be some kind of rectifier there, followed by a low-pass filter. Yet what are all those diodes, resistors, capacitors, and how they are wired together was a mystery. (I'm oversimplifying again :) )

There were theories building the required behavior from the mechanical and electrical properties of the basilar membrane and OHC. There were theories putting these elements into IHC soma - after all, neurons are known for implementing quite sophisticated functions, as they may contain internal analogs of diodes, resistors, and capacitors. There were theories, and rather convincing ones, postulating that most of the observed properties emerge from stochastic patterns of discharges in the "excess" fibers, cross-correlated by a cluster of neurons somewhere deeper in the brain.

Now, the article we are discussing claims, and rather convincingly, that the actual mechanism is split between mechanics of IHC stereocilia and electronics of IHC soma. Moreover, that the observable sophisticated yet stable mapping HL->ADC input, as well as parameters of envelope tracking, are mostly determined by the geometrical and mechanical properties of the stereocilia.

Why is that significant? Well, its significance to different people differs. For me, it finally resolves a couple of auditory system mysteries important for what I do. If you can't think of any, then I guess it is not all that important to you.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,279
Likes
4,785
Location
My kitchen or my listening room.
I wouldn't say it was unimportant, but from an external phenomenon point of view, the envelope effects were quite apparent in the analog to BLMD that shows up at higher frequencies.

Given the firing rate of a neuron, indeed it's pretty amazing we get anything over 30dB from an inner hair cell. The "conversion" of the basic firing to more sophisticated information happens in a few places, I think, from the organ of corti on into the CNS. I stay completely out of the CNS, myself.

The part I find interesting is how the detuning/retuning of depolarized outer hair cells can give us the 90dB or so dynamic range we enjoy when we're young. I think some of that envelope detection, etc, comes about due to the interaction between inner hair cells and the detuning from outer hair cells when depolarized. (I'm with Zwislocki for basic mechanism, I think stiffness vs. viscosity is also related, and yeah, that's more complicated than I want to model.) It is clear that there are several aspects, those being "leading edge", "constant envelope" and "envelope attack".

I'm not going to speculate on how all of that works. I will say that leading edge and envelope attack emphasis must partially come from the loss of sensitivity due to outer hair cell depolarization.

I would never argue that's the only source, I don't know. I can MODEL it that way and get good predictive results (not from a physics model like I refer to above, though!), but that is emphatically not anything like a proof, and I'm pretty sure that the physics imply rather more complexity.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,279
Likes
4,785
Location
My kitchen or my listening room.
Oh, and I should say 'additional information" is a bad way to put this. It's not additional, it's part of the information. Since we detect this, it's hardly additional! :)
 

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
It wasn't completely clear, let's say ten years ago, how exactly the tracking and encoding of envelope happens. Phenomenologically, researchers understood that there ought to be some kind of rectifier there, followed by a low-pass filter. Yet what are all those diodes, resistors, capacitors, and how they are wired together was a mystery.
Is it very fruitful to look for physical 'devices' (or analogies thereof) as evidence of how it works?

A neural network can be trained to provide any function you want (and can also be 'pre-evolved' to a good point to start the training from). The brain, it is thought, is built from neural network structures. You will never find out what is contained in those structures by dissecting them, but they can certainly be performing the functions of capacitors, resistors and diodes and as complex DSP as you like.

We need to think at 'system' level rather than just hardware or software.
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
Is it very fruitful to look for physical 'devices' (or analogies thereof) as evidence of how it works?

It depends on what your goals are. A simple analogy. We usually don't need to care much about how a new leased car is built, we only care about its externally manifested attributes. However, if you decide to buy or to keep a used car, there is huge value in understanding what its components are, how they are connected, what their dominant failure modes are, and how to fix the car when needed.

Correspondingly, there are branches of audio sciences mostly interested in externally manifested attributes of human hearing system, and there is a set of branches dealing with "used ears". In my experience, and experience of my relatives and friends: as we age, the latter becomes more and more relevant to us personally.
 

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
It depends on what your goals are. A simple analogy. We usually don't need to care much about how a new leased car is built, we only care about its externally manifested attributes. However, if you decide to buy or to keep a used car, there is huge value in understanding what its components are, how they are connected, what their dominant failure modes are, and how to fix the car when needed.

Correspondingly, there are branches of audio sciences mostly interested in externally manifested attributes of human hearing system, and there is a set of branches dealing with "used ears". In my experience, and experience of my relatives and friends: as we age, the latter becomes more and more relevant to us personally.
What I mean is that an absence of 'hardware' wouldn't mean that functions such as envelope detection weren't being performed - in 'software'. I don't mean that there wouldn't be any incentive to find the hardware if it was there.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,279
Likes
4,785
Location
My kitchen or my listening room.
Correspondingly, there are branches of audio sciences mostly interested in externally manifested attributes of human hearing system, and there is a set of branches dealing with "used ears". In my experience, and experience of my relatives and friends: as we age, the latter becomes more and more relevant to us personally.

Understanding how the basic "hardware" works is always important in my book, even if I use the phenomena rather than the basic mechanisms, because it exposes the why and how of the phenomena, and allows for improvement.
 
Top Bottom