Blind test - objectivists with tin hearing?

Blumlein 88 · Feb 9, 2019

Well beyond the formula, a good statiscian would like to see 30 or so samples to feel good about filling out a distribution. They'll work with 15, but less than this is iffy, and suggestive at best.

Blumlein 88 · Feb 9, 2019

You also have results showing blind testing works reliably to some low thresholds. So you can't say it lacks useful reliability.

MRC01 · Feb 9, 2019

amirm said:
... You build up confidence towards one outcome and that is that. It may not be 100% but at some point, it is likely to be correct. Is this a flu or just cold? Is it chest pain or heart attack?
... It is possible that the difference exists but unlikely. They are liable to be better than most people so if they can't hear the difference, what chance does the average Joe have? Shades of gray.

I don't think we're far apart. I am a proponent of blind testing. It's an essential tool. I've participated in them and I've written ABX testing software. But every test has sensitivity limits, and it's important to differentiate precision from recall.

Here's why I think this is more than theoretical: we know that switching delays in ABX tests, even less than a second, reduce sensitivity. This proves there are real differences that are obscured by even a small time delay; even short-term memory is imperfect. Intellectual curiousity makes me wonder what other differences might be lost? Audio testing requires time delays and comparisons from memory, even with instantaneous switching, and we already know that even a fraction of a second obscures subtle differences.

I don't claim these differences definitely exist, but only what every engineer or scientist knows: that a negative test doesn't imply they don't exist. And, that known test sensitivity to time delays makes it at least plausible that they might exist. And I can't think of any way to detect or quantify false negatives. Purist objectivism requires a leap of faith, just like purist subjectivism. With DBT/ABX we have a very useful tool that has pushed the state of the art and industry forward, but we need to recognize its limitations lest it seduce us into engineering hubris.

I've read other DBT research but the ITU 1116 doc you referenced is new to me. I'm reading it, especially curious whether they found a way to compare audio with direct perception, without relying on memory

! and whether they found a way to detect or quantify false negatives.

Cosmik · Feb 9, 2019

andreasmaaan said:
It's not even clear from that quote that they are performing these listening tests in a controlled manner.

I'm pretty sure they're not. They are saying that if they give it a quick listen and it sounds as expected, then it's OK. In the same way that any professional simply decides on stuff without conducting a survey or controlled blind experiment first. I should think that 99.9% of the man-made world is built that way!

SIY · Feb 9, 2019

Any time you ask yourself whether ABX (or whatever DBT format you choose- conflating that is an error) somehow suppresses sensitivity, you have to consider how humans under those conditions are quite adept at distinguishing tiny level, EQ, and localization changes. But somehow, the tests are magically insensitive to that stuff that "everybody" hears but can't be captured with straightforward measurements.

It's downright spooky how that works!

amirm · Feb 9, 2019

MRC01 said:
Here's why I think this is more than theoretical: we know that switching delays in ABX tests, even less than a second, reduce sensitivity. This proves there are real differences that are obscured by even a small time delay; even short-term memory is imperfect. Intellectual curiousity makes me wonder what other differences might be lost?

What is lost, is lost to an expert listener. He is already so much better than others that to them, it makes no difference what that extra bit is. When I find differences between MP3 at 320 kbps and original, and it is hard, then I know no one is going to detect what I am hearing let alone what I am missing.

The applicability then becomes one of when theory predicts that something is audible and someone like me can't hear it. There, choice of content, and switching time, etc. becomes valid. I have lived through that many times. Usually if I focus and spend more time on it, I am able to breakdown the wall. A much younger version of me would do even better.

Krunok · Feb 9, 2019

amirm said:
When I find differences between MP3 at 320 kbps and original, and it is hard, then I know no one is going to detect what I am hearing let alone what I am missing.

Wow.. This is THE most modest statement I have encountered on this forum. Congrats!

JJB70 · Feb 9, 2019

I tend to think that if a double blind audio test requires training to discern differences that are inaudible to most, and even then it requires a lot of effort to identify differences then it kind of indicates that even where there are differences then they are not that significant. I've seen audiophiles break into a cold sweat with the effort of identifying subtle differences and then pretend there is a night and day difference, well no, if it was a night and day difference people wouldn't be so stressed out about whether or not they were identifying the different equipment.

I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.

Krunok · Feb 9, 2019

JJB70 said:
I tend to think that if a double blind audio test requires training to discern differences that are inaudible to most, and even then it requires a lot of effort to identify differences then it kind of indicates that even where there are differences then they are not that significant. I've seen audiophiles break into a cold sweat with the effort of identifying subtle differences and then pretend there is a night and day difference, well no, if it was a night and day difference people wouldn't be so stressed out about whether or not they were identifying the different equipment.

I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.

I fully agree.

sergeauckland · Feb 9, 2019

JJB70 said:
I tend to think that if a double blind audio test requires training to discern differences that are inaudible to most, and even then it requires a lot of effort to identify differences then it kind of indicates that even where there are differences then they are not that significant. I've seen audiophiles break into a cold sweat with the effort of identifying subtle differences and then pretend there is a night and day difference, well no, if it was a night and day difference people wouldn't be so stressed out about whether or not they were identifying the different equipment.

I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.

I've worked on this principle for a very long time. Differences that matter are pretty obvious. Those that take a lot of discerning don't.

I too started ripping in MP3, because that was before lossless was invented, and my player (Musicmatch) didn't support anything else at the time. Could I tell 320k MP3 from the CD? Could I hell! When ALAC became available in iTunes, I switched to that, as MP3 playback wasn't gapless, so reripped those in MP3 and have carried on lossless every since, although have given up with iTunes as a ripper and have been using use Exact Audio Copy and FLAC. I justify that to myself that storage is cheap, so the difference between lossless and lossy is trivial in terms of storage costs, and it means I can always recover the original CD if I should ever need to.

For some reason, audiophiles seem to magnify the trivial, almost as if not hearing the difference between the inaudible is an affront to their masculinity.

S.

Sal1950 · Feb 9, 2019

JJB70 said:
I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.

Recently I got my invitation to the Qobuz US beta (192 kbps tier) and have been trying to compare the difference between it and my Spotify 320 stream., I'm finding myself in the same position as you. Very hard to do in any case since near impossible to find a way to level match and be able to rapidly switch between the two. As it stands now I'm finding it hard to justify the minimum of a twice as expensive monthly fee for even the CD tier. Besides that and at least for the music I listen to, Spotify has the largest catalog around.

Kal Rubinson · Feb 9, 2019

MRC01 said:
Visually, we can perceive two different things simultaneously: hold them right next to each other and look at both.

It is more like a rapid A/B comparison as the viewer foveates each item alternatively. In addition, it is probably the same when considering tactile information with the two objects simultaneously palpated in two hands, one probably cannot attend to them simultaneously.

flipflop · Feb 9, 2019

Sal1950 said:
Recently I got my invitation to the Qobuz US beta (192 kbps tier) and have been trying to compare the difference between it and my Spotify 320 stream., I'm finding myself in the same position as you. Very hard to do in any case since near impossible to find a way to level match and be able to rapidly switch between the two. As it stands now I'm finding it hard to justify the minimum of a twice as expensive monthly fee for even the CD tier. Besides that and at least for the music I listen to, Spotify has the largest catalog around.

If you want to compare 192 kbps mp3 to 320 kbps mp3, you can just use a lossless recording, transcode it to both formats and do a DBT with this: https://www.foobar2000.org/components/view/foo_abx

MRC01 · Feb 9, 2019

amirm said:
I was with you till you said it tells you nothing.
The issues you state are true but we have tools to deal with them. ...

Having read that document last night, I see your point. Even though a negative result from a blind test is inconclusive (absence of evidence is not evidence of absence), you can derive something useful from it when testing a group of people. You can measure which listeners have greater acuity than others. And you can use that to filter your listeners and improve the test sensitivity.

However, you still cannot detect (let alone correct for) false negatives. And the evidence we do have suggests that these false negatives do exist. This evidence is that inserting even small (less than 1 second) switching delays reduces test sensitivity. So even short-term memory is imperfect, to a measurable extent. Yet when we perform a blind test, even with instantaneous switching we still rely on memory because we can't simultaneously hear A and B. We are always comparing one with our recent memory of the other.

The practical takeaways are:
We know that audio tests have limited sensitivity, because they rely on memory which is time-sensitive even for fast switching.
Whatever threshold of audibility we measure in blind tests, is not the threshold of inherent hearing acuity; it is the threshold of test sensitivity.
We can reasonably assume that inherent hearing acuity is an even lower threshold (to assume otherwise implies perfect short-term memory, which is implausible).
We don't know how much lower that threshold is because we can't detect the false negatives.

So:
Equipment makers using blind tests should add a safety factor to the minimum thresholds detected in tests. How much to add is up to their discretion.
People who express a preference between A and B, but can't differentiate them in a DBT might be victims of expectation bias or other psychological factors, or they might also be hearing a real difference that is lower than the test threshold but higher than the acuity threshold.

MRC01 · Feb 9, 2019

amirm said:
What is lost, is lost to an expert listener. He is already so much better than others that to them, it makes no difference what that extra bit is. ...

This seems to confuse test sensitivity with listener sensitivity. Imagine 2 people X and Y with equal hearing acuity. But X has better short-term memory. This is plausible since we know people vary widely in memory performance, both short term and long term. In this hypothetical scenario X and Y both hear the same differences, but X has a more accurate and detailed short-term memory to compare it with, thus out-performs Y on the DBT.
Also, since we know DBT are not perfect, like any test they have limited sensitivity, we can also plausibly assume that even X's inherent listening acuity is better than his test performance. His short-term memory, while better than Y, cannot be perfect.
Listening to music is an immersive immediate experience. X and Y both enjoy the full fidelity that their inherent hearing acuity affords, though they measure differently in DBT. (Sure, maybe X's superior short-term memory gives him slightly greater appreciation of the fidelity, but they're both hearing the same thing in the moment of perception).

Certainly, one can argue, as many have, that this difference must be small and insignificant. That might be true! But since we can't detect false negatives, we don't actually know.

MRC01 · Feb 9, 2019

Kal Rubinson said:
It is more like a rapid A/B comparison as the viewer foveates each item alternatively. In addition, it is probably the same when considering tactile information with the two objects simultaneously palpated in two hands, one probably cannot attend to them simultaneously.

Possibly! Though if so, the switching is more rapid than it can be with audio testing which requires absolute sequential separation. And we know through testing that even small time delays have a measurable impact on test sensitivity.

sergeauckland · Feb 9, 2019

MRC01 said:
Possibly! Though if so, the switching is more rapid than it can be with audio testing which requires absolute sequential separation. And we know through testing that even small time delays have a measurable impact on test sensitivity.

Yes, but switching can be instantaneous, to a few milliseconds, possibly faster, so one may not even perceive that switching has happened unless there's a change in sound quality. When doing AA AB BA BB testing, and recording whether the change is same/different, that sort of short switching time should not be any problem for anyone, however short their audio memory. They can switch back and forth as often as they like, as they only have to identify whether the two are same/different. It's only once it has been established that a difference exists is there any point in going further with preferences or trying to identify what the differences are.

So often, differences clearly audible with sighted testing can't be found with blind testing. However, because AA AB BA BB testing doesn't rely on audio memory, unlike ABX, which does, volume matching has to be done to very close levels, as even a small level difference can be identified as 'different'.

S.

MRC01 · Feb 9, 2019

Ah, but even instantaneous switching doesn't enable you to listen to A and B simultaneously. You're always listening to one and comparing what you hear to your recent memory of the other.

Put differently: you can hear A, and you can hear B, but you can't hear the difference between A and B. The difference is not something you directly perceive, but is created in your mind by comparing what you perceive to a memory of another perception.

One could say that switching time is so short as to be insignificant. But we know that you must listen to each for at least a second or two (probably longer) just to hear it properly, so in the comparison you're relying on audio memory of something several seconds old. Yet we know even a fraction of a second impairs audio memory to a measurable amount (as observed in the correlation of switch delay to test sensitivity).

Thus, we can plausibly assume actual hearing acuity thresholds are lower than test sensitivity thresholds. How much lower, we can't measure.

Blumlein 88 · Feb 9, 2019

MRC01 said:
Having read that document last night, I see your point. Even though a negative result from a blind test is inconclusive (absence of evidence is not evidence of absence), you can derive something useful from it when testing a group of people. You can measure which listeners have greater acuity than others. And you can use that to filter your listeners and improve the test sensitivity.

However, you still cannot detect (let alone correct for) false negatives. And the evidence we do have suggests that these false negatives do exist. This evidence is that inserting even small (less than 1 second) switching delays reduces test sensitivity. So even short-term memory is imperfect, to a measurable extent. Yet when we perform a blind test, even with instantaneous switching we still rely on memory because we can't simultaneously hear A and B. We are always comparing one with our recent memory of the other.

The practical takeaways are:
We know that audio tests have limited sensitivity, because they rely on memory which is time-sensitive even for fast switching.
Whatever threshold of audibility we measure in blind tests, is not the threshold of inherent hearing acuity; it is the threshold of test sensitivity.
We can reasonably assume that inherent hearing acuity is an even lower threshold (to assume otherwise implies perfect short-term memory, which is implausible).
We don't know how much lower that threshold is because we can't detect the false negatives.

So:
Equipment makers using blind tests should add a safety factor to the minimum thresholds detected in tests. How much to add is up to their discretion.
People who express a preference between A and B, but can't differentiate them in a DBT might be victims of expectation bias or other psychological factors, or they might also be hearing a real difference that is lower than the test threshold but higher than the acuity threshold.

And how would a listener have access to this additional acuity except in such testing? There are things detectable without a reference, but the threshold is higher. Smaller differences detectable with a reference to switch to. The faster the switching the better the results, and the lower the threshold. You posit since it all relies on some memory someone could be hearing something that is below the testing to detect which leads to a false negative. But under what circumstances could this be accessible to the perception of a listener?

This is one of the benefits of rapid switching in blind testing. When we've gotten this close to thresholds they are already well past what someone could pick up on in casual listening without a reference. Or polluted sighted long term listening. That is already a margin of safety vs the normal use of the audio equipment for listening to music.

Krunok · Feb 9, 2019

Blumlein 88 said:
.. When we've gotten this close to thresholds they are already well past what someone could pick up on in casual listening without a reference. Or polluted sighted long term listening. That is already a margin of safety vs the normal use of the audio equipment for listening to music.

Exactly.

Blind test - objectivists with tin hearing?

Grand Contributor

Grand Contributor

Major Contributor

Major Contributor

Grand Contributor

Founder/Admin

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Grand Contributor

Master Contributor

Addicted to Fun and Learning

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Grand Contributor

Major Contributor

Similar threads