• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

MQA creator Bob Stuart answers questions.

LTig

Master Contributor
Forum Donor
Joined
Feb 27, 2019
Messages
5,803
Likes
9,511
Location
Europe
[..]
Record again, and get a different time offset between the playback and recording clocks, and you get different samples from the ADC.

View attachment 27341


Note the upsample creates the same looking wave from the (very) different looking samples. (Upsampling serves as my on-screen visual reconstruction filter for the already band-limited samples)

Monty Montgomery would approve.
Very nice, and a perfect proof that the timing resolution is much higher than the sampling period.

The upsampled data actually display how the analog signal after the reconstruction filter looks like. Looking at the original data points (48 KHz) one must not draw lines between the points to make clear that these lines do not represent the reconstructed analog signal, because the series of data points also contains the alias frequencies (which are removed by the reconstruction filter later). When the frequency of the sampled signal is close to half of the sample frequency the sum of the sampled signal and its alias signal no longer look like the sampled signal alone.

The upsampling process moves the alias signal much higher in frequency hence the data points are a better representation of the original wave form.
 
Last edited:

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,473
Likes
4,090
Location
Pacific Northwest
Regarding whether human hearing is symmetric in the time & frequency domains: a pure 15 kHz tone at normal levels is now just beyond my hearing. I used to be able to hear beyond that when younger, and I have good hearing for my age, but am in my 50s. However, when playing music with strong HF content up to and beyond 20 kHz (such as a good recording of castanets, or keys being jangled in front of the mic), I can differentiate A from B in a double blind test when B is low pass filtered at 15 kHz. B sounds slightly less crisp or smeared in time.
One possible explanation is that human hearing is asymmetric in the frequency vs time domain. When it comes to very high frequencies near the limits of hearing, it's easier to hear their contribution to making transients crisper, than it is to hear them as pure tones. Of course it's also possible that I'm hearing something else, perhaps slight phase shift of a minimum phase digital filter.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,680
Likes
37,389
Regarding whether human hearing is symmetric in the time & frequency domains: a pure 15 kHz tone at normal levels is now just beyond my hearing. I used to be able to hear beyond that when younger, and I have good hearing for my age, but am in my 50s. However, when playing music with strong HF content up to and beyond 20 kHz (such as a good recording of castanets, or keys being jangled in front of the mic), I can differentiate A from B in a double blind test when B is low pass filtered at 15 kHz. B sounds slightly less crisp or smeared in time.
One possible explanation is that human hearing is asymmetric in the frequency vs time domain. When it comes to very high frequencies near the limits of hearing, it's easier to hear their contribution to making transients crisper, than it is to hear them as pure tones. Of course it's also possible that I'm hearing something else, perhaps slight phase shift of a minimum phase digital filter.
Your threshold at the upper edge rises rapidly. It is possible if you can just or just almost can hear 15 khz at normal listening levels that castanets will peak a good bit higher. Meaning you still hear a little beyond 15 khz at higher levels just not normal listening levels. So you would hear a difference in the castanets where you might not on a lower level test tone. You would need to make sure the castanets had no higher peak level than whatever tone you tested yourself with.
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
You talk about samples every 5us (192kHz). That would not capture what the rat endured 0.0005us !

Not precisely. Yet pretty close under two conditions:

- The duration of pulse is much shorter than the period of characteristic frequency of a basilar membrane of the species. Characteristic frequency is usually close to the resonance frequency, yet can be strongly affected by other factors, such as the particulars of damping.

- The mechanical momentum transferred, proportional to the integral of the sound pressure over the duration of the pulse, is the same. So, given the four magnitudes of difference between the durations, we ought to lower the amplitude of impulse by four magnitudes.

There are more subtleties, considering transfer of mechanical momentum vs transfer of energy: if the system from the tympanic membrane to the inner hair cell were ideally rigid, only momentum would be transferred. That would give the equivalent SPL of 250 - 40 = 210 dB to reproduce the effect of that experiment in a rodent ear with the 5 microseconds pulse.

In reality ear structures are not fully rigid, because cells, membranes, and middle ear bone joints are elastic, so some transfer will happen in the form of energy. The energy transferred would be proportional to the integral of square of the sound pressure over the duration of the pulse. For pure energy transfer, we would get an equivalent SPL of 250 - 80 = 170 dB.

The real-life equivalent SPL would be somewhere between 170 dB and 210 dB. Interestingly enough, the mid-point between those is 190 dB, in the vicinity of the maximum SPL that can be transferred through air by regular, non-shockwave sound waves: 194 dB. It is plausible to conclude that the mammal hearing system evolved to endure without irreversible damage the maximum SPL that it could encounter in nature.

What's the relevance and why would you think it is audible? (you claimed this) Simply because the outer haircells were shot and thus must have produced 'a loud sound' ?
A single half sinewave of a 2.5MHz signal ?

Perhaps an analogy would help. When a conventional Uranium-235 nuclear bomb explodes, nuclear fission process that generates the energy of the blast only lasts a microsecond. Secondary physical processes transform the energy released by the fission into what eventually reaches human ear as a loud sound - very loud if you are close enough.

Similarly, it is not the pulse itself that makes the sound, but what basilar membrane and other structures of the ear "do" with the mechanical momentum and energy delivered by the pulse. It is a perfectly valid question to ask what these structures do, on a qualitative level. The answers will be different depending on the characteristics of the pulse.

For instance, pulses with periods much longer than the period of basilar membrane characteristic frequency - let's say 1 Hz - are generally not heard, virtually regardless of amplitude, because cochlea employs a mechanism which hydraulically cancels them out, and thus inner hair cells don't react to it. Theoretically, there could be a mechanism in cochlea that cancels out the very short duration pulses as well.

As the experiment plainly shows, there isn't such a canceling mechanism, sufficiently effective for short duration pulses at the SPL applied. The cochlea still reacted, qualitatively, in a way characteristic to the one it uses for detecting sounds in the audible range.

Accounts of blast victims confirm that: subjectively, if a blast doesn't destroy the cochlea completely, the sensation is that of a very loud sound, followed by intense ringing in the ears (which also has neurophysiological explanation).

What's it have to do with complex music signals ?

Some genres of music contain many transients. Transients are heard when listening to such music live. If transients are not captured and reproduced with sufficient accuracy, the music doesn't sound natural, as some of the temporal cues allowing to separate perceptual "sound objects" are taken away.

Some (most?) commonly used recording formats don't capture the transients. Some (most?) commonly used sound systems don't reproduce them. That's a problem. MQA supposedly deals with part of this problem effectively, without falling back to uncompressed representations of transient-rich fragments of music, like other competing lossy compressing codecs do - this claim still needs to be verified.
 

mansr

Major Contributor
Joined
Oct 5, 2018
Messages
4,685
Likes
10,703
Location
Hampshire
Some (most?) commonly used recording formats don't capture the transients. Some (most?) commonly used sound systems don't reproduce them. That's a problem. MQA supposedly deals with part of this problem effectively
Even if that were a real problem (it isn't), MQA couldn't possibly solve it. The output at the end of the MQA chain is regular PCM data sent to a regular DAC. Ergo, whatever shortcomings are inherent to linear PCM, they will be imposed at this final stage regardless of what voodoo precedes it.
 

Costia

Member
Joined
Jun 8, 2019
Messages
37
Likes
21
Some (most?) commonly used recording formats don't capture the transients. Some (most?) commonly used sound systems don't reproduce them.
If by transients you mean short pulses, then it will be captured, recorded and reproduced by 20hz-20khz equipment.
The pulse won't be "missed" if it happens "between" samples.
A short pulse, no matter how short it is, contains frequencies in the 20hz to 20khz range.
Those will be captured by 20hz-20khz recording equipment even if the pulse 0.0005us long.
Then it can be stored in a 16/44 PCM.
And it will be reproduced by anything that can reproduce the PCM.
The only "downside" is that it won't reproduce the ultrasonic components of the pulse that you can't hear anyway.


Here is a 1/384000 long pulse captured in 48000 PCM:
It's still there even though the pulse happened between samples.
It's not sharp anymore because the ultrasonics were dumped, but you can't hear those anyway.
i.e. you ear's response will be similar to the wavy graph either way.
54o3riZ.png
 
Last edited:

nscrivener

Member
Joined
May 6, 2019
Messages
76
Likes
117
Location
Hamilton, New Zealand
Even if that were a real problem (it isn't), MQA couldn't possibly solve it. The output at the end of the MQA chain is regular PCM data sent to a regular DAC. Ergo, whatever shortcomings are inherent to linear PCM, they will be imposed at this final stage regardless of what voodoo precedes it.

I'm not sure that's necessarily true. Of course I agree the premise is probably faulty. Band limiting signals does cause pre and post ringing of sharp transients when reconstructed. Whether this is audible is highly debatable. But it is measureable. And it is possible to "solve" this problem by a) increasing the bandwidth and b) changing the filtering to a less steep filter. The latter will allow more aliasing artifacts through. You could argue (probably incorrectly) that the improvement in measured transient response results in an improvement notwithstanding the alialising. That's part of what MQA does. Of course this is achievable with standard PCM if that's what you want to achieve. MQA also purports to preserve some of the higher frequency information applicable to those transients in it's lossy encoding scheme.

So I think that it's possible that MQA solves the "problem" it is seeking to address. But as many have commented it is a "solution looking for a problem" rather than a "problem that needed solving".
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,473
Likes
4,090
Location
Pacific Northwest
...
The study does show that people can, in fact, distinguish high res audio from standard resolution, and a number of the underlying studies report such a result even on real-world content (not just artificial test signals), which is fairly impressive. What the study also shows, however, if that people certainly can't reliably make this distinction. For untrained listeners, the lower bound on the confidence interval is less than 50% success rate, which pretty much means untrained listeners can't tell the difference. Trained listeners fared better - [57.5, 66.9] confidence interval with a mean of 62.2%. This still means that even trained listeners have a really hard time - it's a really subtle difference, and keep in mind the stimuli used were probably hand-picked for this task.
...
Yet it means the effect is real. 44-16 is transparent for most people, but not quite enough resolution to be completely transparent to all people. The question I find interesting and pragmatic is how much more must we have in order to be transparent to all people? 44-16 is very close, so almost certainly 192/24 and 96/24 are overkill. I suspect that 48 or 64 kHz sampling would be sufficient to be completely transparent to all people. But I haven't seen studies exploring this.

... Band limiting signals does cause pre and post ringing of sharp transients when reconstructed. Whether this is audible is highly debatable. But it is measureable. And it is possible to "solve" this problem by a) increasing the bandwidth and b) changing the filtering to a less steep filter. The latter will allow more aliasing artifacts through. ...
I agree. If you follow the reconstruction formula math, the ringing (pre/post echo) is at Nyquist frequency. So 44.1 kHz sampling rings at a frequency most humans can't hear. If you use a little bit higher sampling frequency (say 48 or 64 kHz) that frequency will be inaudible to all humans. So both of your solutions (A) and (B) apply. Use a higher sampling frequency, but start the filter transition band at the same frequency as before, giving a wider transition band/more gradual slope. For example, sample at 64 kHz with a transition band from 22 kHz to 32 kHz.

Despite evidence that some people can hear the limitations of 44-16, it's often the case that people who claim that we need more than 44-16 use invalid arguments. Transient response and time resolution is a classic case where mis-understandings about how encoding and reconstruction work, lead to specious arguments. Any transient that is so fast that it might be reconstructed differently based on the timing of samples, is by definition above Nyquist frequency and therefore cannot exist if the signal was properly encoded. That's called aliasing and is why the bandwidth filter during encoding is called an "anti-aliasing" filter.
 
Last edited:

nscrivener

Member
Joined
May 6, 2019
Messages
76
Likes
117
Location
Hamilton, New Zealand
One thing I have noticed listening to MQA masters on Tidal is occasionally there is a master that has a lower LUFS than the CD equivalent. Those tend to sound better for the same reason ordinary PCM would sound better at a lower average volume level. But interestingly this is not consistent and there are plenty of flat sounding, overly compressed MQA masters. So MQA still manages to perpetuate the single biggest culprit in reduced audio fidelity in digital formats: excessive average loudness levels.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,680
Likes
37,389
Yet it means the effect is real. 44-16 is transparent for most people, but not quite enough resolution to be completely transparent to all people. The question I find interesting and pragmatic is how much more must we have in order to be transparent to all people? 44-16 is very close, so almost certainly 192/24 and 96/24 are overkill. I suspect that 48 or 64 kHz sampling would be sufficient to be completely transparent to all people. But I haven't seen studies exploring this.


I agree. If you follow the reconstruction formula math, the ringing (pre/post echo) is at Nyquist frequency. So 44.1 kHz sampling rings at a frequency most humans can't hear. If you use a little bit higher sampling frequency (say 48 or 64 kHz) that frequency will be inaudible to all humans. So both of your solutions (A) and (B) apply. Use a higher sampling frequency, but start the filter transition band at the same frequency as before, giving a wider transition band/more gradual slope. For example, sample at 64 kHz with a transition band from 22 kHz to 32 kHz.

Despite evidence that some people can hear the limitations of 44-16, it's often the case that people who claim that we need more than 44-16 use invalid arguments. Transient response and time resolution is a classic case where mis-understandings about how encoding and reconstruction work, lead to specious arguments. Any transient that is so fast that it might be reconstructed differently based on the timing of samples, is by definition above Nyquist frequency and therefore cannot exist if the signal was properly encoded. That's called aliasing and is why the bandwidth filter during encoding is called an "anti-aliasing" filter.

J_J who on occasion posts here suggested 64 khz sampling with filtering starting at 25 khz would have given us good margins with no worries.
 

nscrivener

Member
Joined
May 6, 2019
Messages
76
Likes
117
Location
Hamilton, New Zealand
J_J who on occasion posts here suggested 64 khz sampling with filtering starting at 25 khz would have given us good margins with no worries.

That sounds reasonable. I'd place money on the bet that: a) even if some trained listeners can discriminate higher resolution formats from standard CD quality audio in some controlled circumstances (some of the time), b) there would be no statistically significant success in discriminating between the above parameters and even higher resolution formats. Unless of course even higher resolution harms the signal in some way - for example by attempting to play back ultrasonic frequencies on equipment that suffers from intermodulation distortion as a result.
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
So it would result in aliasing?
Does the animal "hear" it as if it were a lower frequency tone, since it stimulates the same hair cells as a regular sound would?

At non-destructive intensities, most often, it is perceived as a pitch-less audio event. A good "litmus test" by the way: if you can ascribe a pitch to a "click" you are hearing, it is more likely that you are hearing audible frequency components of your sound system impulse decay. The "proper" click is, how should I put it - "dry" - and has no discernible pitch.

Concretely, what exactly you hear depends on multiple factors: at the very least on pulse amplitude and shape, and on the state in which concrete hair cells are at that moment - they could be "fresh and alert", not having "heard" anything during the previous ~2 ms, or be "tired and recuperating", that is, masked at the moment the pulse arrives.

What I'm describing is not aliasing of a sinusoid of much higher frequency. Because there could be no sinusoid involved at all.
 

nscrivener

Member
Joined
May 6, 2019
Messages
76
Likes
117
Location
Hamilton, New Zealand
What I'm describing is not aliasing of a sinusoid of much higher frequency. Because there could be no sinusoid involved at all.
Aliasing has nothing to do with sine waves.
Nor does it have anything to do with the structure of the human ear.
This is basic sampling theory that even I understand from spending a couple of hours reading.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,680
Likes
37,389
That sounds reasonable. I'd place money on the bet that: a) even if some trained listeners can discriminate higher resolution formats from standard CD quality audio in some controlled circumstances (some of the time), b) there would be no statistically significant success in discriminating between the above parameters and even higher resolution formats. Unless of course even higher resolution harms the signal in some way - for example by attempting to play back ultrasonic frequencies on equipment that suffers from intermodulation distortion as a result.

That is what I think too. Maybe I need to go read that meta-analysis again, but the best tests I've seen of rates above 48 khz show a marginal chance they can hear more with higher rates for young people with good hearing and on some types of music, and then just barely. Going to 88.2 or 96 khz should fix all that.

MQA's results were only positive barely with large samples of choices over excellent gear for listeners with special training. To imply it is a big, big quality improvement is beyond dubious. And this wasn't with MQA it was with high rates vs low rates with some idea it relates to what MQA can do. Which begs the question, why has Mr. Stuart not done the big test with MQA vs 44/16? Write the AES paper on that. Oh and don't handicap it with less than current dither or downsampling. Straight up MQA vs 44/16 over stupendously good systems with trained listeners. Show me the high rate of identification. I think the lack of such tests along with the early demos that never let you compare MQA to direct redbook tell me all I need to know.

As a counterpoint, listen to recordings in 7.5 ips reel to reel and 15 inch reel to reel. You have no trouble hearing the benefit. MQA doesn't provide this level of benefit. 44.1 khz vs 96 khz or vs 352 khz if there is a benefit it is tiny and difficult to determine from listening.
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
Regarding whether human hearing is symmetric in the time & frequency domains: a pure 15 kHz tone at normal levels is now just beyond my hearing. I used to be able to hear beyond that when younger, and I have good hearing for my age, but am in my 50s. However, when playing music with strong HF content up to and beyond 20 kHz (such as a good recording of castanets, or keys being jangled in front of the mic), I can differentiate A from B in a double blind test when B is low pass filtered at 15 kHz. B sounds slightly less crisp or smeared in time.
One possible explanation is that human hearing is asymmetric in the frequency vs time domain. When it comes to very high frequencies near the limits of hearing, it's easier to hear their contribution to making transients crisper, than it is to hear them as pure tones. Of course it's also possible that I'm hearing something else, perhaps slight phase shift of a minimum phase digital filter.

I conducted different experiments on myself, yet with similar results. One of the reasons I believe in what I said.
 

RayDunzl

Grand Contributor
Central Scrutinizer
Joined
Mar 9, 2016
Messages
13,245
Likes
17,144
Location
Riverview FL

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
If by transients you mean short pulses, then it will be captured, recorded and reproduced by 20hz-20khz equipment.
The pulse won't be "missed" if it happens "between" samples.
A short pulse, no matter how short it is, contains frequencies in the 20hz to 20khz range.
Those will be captured by 20hz-20khz recording equipment even if the pulse 0.0005us long.
Then it can be stored in a 16/44 PCM.
And it will be reproduced by anything that can reproduce the PCM.
The only "downside" is that it won't reproduce the ultrasonic components of the pulse that you can't hear anyway.


Here is a 1/384000 long pulse captured in 48000 PCM:
It's still there even though the pulse happened between samples.
It's not sharp anymore because the ultrasonics were dumped, but you can't hear those anyway.
i.e. you ear's response will be similar to the wavy graph either way.
54o3riZ.png

Illustrates nicely what I'm talking about. Assuming that the perceptual energies corresponding to the upper and to the lower graphs are the same - this is not obvious on the logarithmic scale to a naked eye, but they must be the same, right?

We take the upper graph as describing a real-life signal, a component of live music. We take the lower graph (well, when it is upsampled to look more like an analog signal), as a representation of "CD sound".

Let's consider three situations:

(1) The energy delivered is 4x of the threshold energy required to elicit a sensation of sound.
(2) The energy delivered is 2x of that threshold energy.
(3) The energy delivered equals the threshold energy.

With the upper graph in play, the perceptual onset of signal will be registered with the lowest neurophysiological resolution of ~5 microseconds in all three cases.

With the lower graph, the cases (1), (2), and (3) become distinct:

(1) IHC will accumulate enough energy to spike at a point before the midpoint of the graph, creating an illusion of sound arriving earlier.
(2) The sound will be perceived as arriving at the midpoint - that is, at the right time.
(3) The sensation will be triggered at the very end of the right tail of the lower graph, creating an illusion of sound arriving later.

So, what we have here is that the delta between the time of the "live" pulse, as perceived by the hearing system, and the time of arrival of the "CD pulse", once again as perceived by the hearing system, depends on the pulse amplitude. Is it a benign auditory illusion? I don't think so.

I believe this illusion partially contributes to the "loudness wars". If amplitude of a frequency component is the same throughout a song on a CD, then the shift of the perceptual time of arrival will remain constant, preserving the time between auditory nerves spikes. This amounts to a phase shift, which can be either corrected via a FIR filter, or just left alone, in a hope that pop music doesn't contain that many closely-spaced frequency components anyway - phase inconsistencies are generally considered benign if the frequencies in question are separated by at at least one critical band.

If the amplitude of a frequency component delivered over the CD chain sufficiently changes over the characteristic time of auditory system short-term memory, then the perceived location of a sound source emitting this component may perceptually move, or blur (depends on inter-frequency masking and other factors). This illusion can be either barely noticeable, noticeable yet entertaining, or noticeable and irritating, depending on the characteristics of the music and the listener.
 

Sergei

Senior Member
Forum Donor
Joined
Nov 20, 2018
Messages
361
Likes
272
Location
Palo Alto, CA, USA
Aliasing has nothing to do with sine waves.
Nor does it have anything to do with the structure of the human ear.
This is basic sampling theory that even I understand from spending a couple of hours reading.

Good start! Please keep reading. You may approach understanding of what I am saying after you can explain the results of the following experiment, involving four sinusoids with the same amplitude, but of different durations. Make all the sinusoids extended with fade-in and fade-out, let's say of 5 ms, to exclude influence of transients.

(1) Listen to a segment of 1 KHz sinusoid, with duration of 60 ms.

(2) Listen to a segment of 1 KHz sinusoid, with duration of 120 ms.

(3) Listen to a segment of 1 KHz sinusoid, with duration of 240 ms.

(4) Listen to a segment of 1 KHz sinusoid, with duration of 480 ms.

--- Do (1) and (2) sound like having the same loudness? Yes or No? Why?
--- How about (2) vs (3)?
--- How about (3) vs (4)?
--- Finally, how about (1) vs (4)? Can you explain what you hear?
 
Top Bottom