• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Blind test - objectivists with tin hearing?

Don Hills

Addicted to Fun and Learning
Joined
Mar 1, 2016
Messages
708
Likes
464
Location
Wellington, New Zealand

sergeauckland

Major Contributor
Forum Donor
Joined
Mar 16, 2016
Messages
3,460
Likes
9,161
Location
Suffolk UK
And as true today as it was 30 years ago. In some respects even more so given the increase in Foo products since then.

S
 

Don Hills

Addicted to Fun and Learning
Joined
Mar 1, 2016
Messages
708
Likes
464
Location
Wellington, New Zealand
Worse, he was bemoaning a state of affairs that began 20 years before he wrote it.
 

Phronesis

Member
Joined
May 13, 2018
Messages
39
Likes
56
Location
Maryland, USA
I concur with the comment that expectations of there being no difference will influence how someone approaches a listening test, and blinding won't solve that problem.

If someone expects A and B to sound the same, they'll tend to perceive in a way that they sound the same, and if they suspect that they might be hearing differences, they'll tend to conclude that there aren't really any differences. When we make these comparisons, due to the fallibility of perception and memory, there's usually some uncertainty about whether there are differences (especially when actual differences are small), so we can tip towards perceiving or not perceiving differences based on our expectations.

Another aspect is that, even if differences audible to a given listener exist, whether the listener detects them will be influenced by the specific differences they expect to hear (e.g., if someone expects to hear a difference in level of detail, they may miss a difference in the amount of bass).
 

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,511
Likes
25,347
Location
Alfred, NY
Let me ask an obvious question: if I want to determine whether A and B sound different, why in the world would I use someone as a test subject for a double blind test who doesn't hear a different between them sighted? Likewise, if I don't hear a difference between A and B, why would I run a double blind test on myself? It makes no sense.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,485
Likes
4,111
Location
Pacific Northwest
... The point of a blind test is to measure the impact of sound alone. As with the argument above where the objectivist wanted to show there was no audible difference, telling objectivists "here comes a cable test" may shut them down and want to vote with their opinion, not what they could have heard.
At the risk of resurrecting a stale discussion, in thinking about blind or double-blind testing one of the fundamental differences I think contributes to the debate and sometimes confusion, is the fact that these tests are asymmetric with respect precision and recall. In these tests, precision is known and controllable but recall is unknown. Put differently: you can control and measure the rate of false positives, but you can't measure let alone control false negatives.

You can reduce the rate of false positives by controlling for known factors like level matching, and by doing more trials. The more trials, the higher the precision. Of course the number of trials is limited by listener fatigue, but 95% confidence only requires 5 straight, or 7 of 8, and most listeners can go at least that long before fatigue sets in. So in practice you can get better than 95% confidence of having no false positives.

Problem is, you never know about false negatives. They can happen for any number of factors: switching delay, listener fatigue, stress, appropriateness of musical selection, whatever. But there's no way to measure let alone control them. The crux is we are biologically incapable of properly discerning by listening to both A and B simultaneously, so we must switch back and forth, which is inherently imperfect. It's possible (I would say more than possible, but plausible) that there are differences we can hear but are too subtle to pass through this filter.

So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.

In short, I think A/B/X testing is an important tool that reveals valuable insights. Certainly, participating in them (and training for them) over the years has honed my perception while at the same time making me a bit more humble, and taught me to be a more careful audiophile. Yet this asymmetry of its results is an important limitation to understand.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,485
Likes
4,111
Location
Pacific Northwest
PS: in fewer words
My first reaction is to say DBT is "high precision, low recall". But that's not quite correct. It's more accurate to say it's "Controlled precision, unknown recall."
 

jsrtheta

Addicted to Fun and Learning
Joined
May 20, 2018
Messages
947
Likes
1,008
Location
Colorado
At the risk of resurrecting a stale discussion, in thinking about blind or double-blind testing one of the fundamental differences I think contributes to the debate and sometimes confusion, is the fact that these tests are asymmetric with respect precision and recall. In these tests, precision is known and controllable but recall is unknown. Put differently: you can control and measure the rate of false positives, but you can't measure let alone control false negatives.

95% confidence only requires 5 straight, or 7 of 8, and most listeners can go at least that long before fatigue sets in. So in practice you can get better than 95% confidence of having no false positives.

Problem is, you never know about false negatives. They can happen for any number of factors: switching delay, listener fatigue, stress, appropriateness of musical selection, whatever.

It's possible (I would say more than possible, but plausible) that there are differences we can hear but are too subtle to pass through this filter

You sure have a lot of excuses here. If ABX is employed properly, there is no switching delay. Nor are all the trials required to be performed in the same session, so I do not see how there is any "listener fatigue". And the minimum number of tests is 10, for which 9 correct gives a score of 95% confidence, not "5 straight, or 7 of 8". And the listener can pick his own musical material for use in the test.

And the "differences we can hear but are too subtle to pass through this filter" trope is just another way of saying "I can hear differences, but they are too subtle to hear."
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,485
Likes
4,111
Location
Pacific Northwest
You sure have a lot of excuses here. If ABX is employed properly, there is no switching delay. Nor are all the trials required to be performed in the same session, so I do not see how there is any "listener fatigue". And the minimum number of tests is 10, for which 9 correct gives a score of 95% confidence, not "5 straight, or 7 of 8". And the listener can pick his own musical material for use in the test.
...
It's not about excuses. I'm a proponent of ABX. But it's important to understand its limitations. I can think of only 2 objections to this:
1. Claim that false negatives cannot happen.
2. Claim that ABX tests can detect or control for false negatives.
Neither of these seems plausible. Can you imagine other objections?

I believe you have the math incorrect. Here's where I got those numbers: check my math. If you have X trials, each 50% chance to guess correctly, and got Y of them correct, the formula is:
(X choose Y) * (1/2)^Y * (1/2)^(X-Y)
That means
5 in a row is 96.9% confidence (3.1% chance to get them right by guessing)
7 of 7 is 99.2% (0.8% chance to get them right by guessing)
7 of 8 is 96.5% (3.5% chance to get them right by guessing)
Your example 9 of 10 would be 98.9% confidence.
 

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,406

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,511
Likes
25,347
Location
Alfred, NY
Why do you restrict double blind to the ABX format? There’s lots of other ones, the control factor being “double blind.” One chooses the appropriate format for the question being studied.

I’ve never seen anyone who actually does sensory testing ever claim that false negatives are impossible. But when a proposition is tested over and over by lots of different people under lots of different conditions and the results are always negative, it’s a pretty safe presumption that the negative result is correct.
 

jsrtheta

Addicted to Fun and Learning
Joined
May 20, 2018
Messages
947
Likes
1,008
Location
Colorado
It's not about excuses. I'm a proponent of ABX. But it's important to understand its limitations. I can think of only 2 objections to this:
1. Claim that false negatives cannot happen.
2. Claim that ABX tests can detect or control for false negatives.
Neither of these seems plausible. Can you imagine other objections?

I believe you have the math incorrect. Here's where I got those numbers: check my math. If you have X trials, each 50% chance to guess correctly, and got Y of them correct, the formula is:
(X choose Y) * (1/2)^Y * (1/2)^(X-Y)
That means
5 in a row is 96.9% confidence (3.1% chance to get them right by guessing)
7 of 7 is 99.2% (0.8% chance to get them right by guessing)
7 of 8 is 96.5% (3.5% chance to get them right by guessing)
Your example 9 of 10 would be 98.9% confidence.

The protocol for such testing is a minimum of 10 tests. Read up on how DBTs work.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,485
Likes
4,111
Location
Pacific Northwest
I believe your statement that "the minimum number of tests is 10, for which 9 correct gives a score of 95% confidence," is incorrect and explained why. Do you see a problem with my math?
If your goal is 95% confidence allowing 1 mistake, the minimum # of trials is 8. That's because 7 of 8 is 96.5%.
7 trials is not enough; if you miss one, 6 of 7 is only 93.8%.
Of course there's nothing magic about 95%. You can choose any confidence level you want, and apply the math to determine how many trials you need. Or just do the trials then compute your confidence from the results.

However, it's possible that some researchers use a different formula based on different assumptions. I'm assuming each trial is a 2-way A/B comparison so every choice is binary. You have a 50% chance to get each individual trial correct by guessing.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,485
Likes
4,111
Location
Pacific Northwest
...And the "differences we can hear but are too subtle to pass through this filter" trope is just another way of saying "I can hear differences, but they are too subtle to hear."
That's amusing, but it's a bit more subtle. Since we can't discern A and B simultaneously, we must switch back and forth, which requires memory, which is a step removed from direct perception. It's at least plausible, though certainly not provable, that we can discern more subtle differences though direct perception, than we can recall through memory of perception.

So the statement I could agree with is: "I might be hearing differences that are too subtle to discern through memory, rather than immediate perception". Even if the ABX test has zero switch delay, you are still comparing what you are directly perceiving, with your recent memory of what you perceived before. That's a step removed from comparing directly. Since ABX switching delays are well-known to reduce test sensitivity, it's at least plausible that memory of perception is inherently less sensitive than direct perception.

I don't claim that these false negatives definitely exist. I think the evidence from switch delays suggests it's plausible, but equally well informed people may disagree on that! My claim is that blind testing (in ABX or any other format) can't detect false negatives, so whether they exist is an open question. This precision vs. recall distinction makes blind testing asymmetric with respect to false positives & false negatives, which creates some confusion and endless debate.
 
Last edited:

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,406
That's amusing, but it's a bit more subtle. Since we can't discern A and B simultaneously, we must switch back and forth, which requires memory, which is a step removed from direct perception. It's at least plausible, though certainly not provable, that we can discern more subtle differences though direct perception, than we can recall through memory of perception.

So the statement I could agree with is: "I might be hearing differences that are too subtle to discern through memory, rather than immediate perception". Even if the ABX test has zero switch delay, you are still comparing what you are directly perceiving, with your recent memory of what you perceived before. That's a step removed from comparing directly. Since ABX switching delays are well-known to reduce test sensitivity, it's at least plausible that memory of perception is inherently less sensitive than direct perception.

I don't claim that these false negatives definitely exist. I think the evidence from switch delays suggests it's plausible, but equally well informed people may disagree on that! My claim is that blind testing (in ABX or any other format) can't detect false negatives, so whether they exist is an open question. This precision vs. recall distinction makes blind testing asymmetric with respect to false positives & false negatives, which creates some confusion and endless debate.

It’s an interesting argument. If I understand correctly, you’re saying that DBT’s render people unable to discern differences that they may be able to discern only if they were able to have dual conscious experiences simultaneously.

This argument seems to concede that, given the limitation of being able to have only one conscious experience at a time, humans are not able to discern differences between X and Y.

And yet you seem to be saying that, despite being unable to discern any difference, one can have a preference that stands indepemtly of the kinds of extraneous factors (e.g. expectation bias, placebo effect, etc.) that DBTs control for.

If this is so, how can you escape saying that it’s possible to prefer one of two things between which (given you are limited to one conscious experience at a time) you can’t discern any difference (other than for extraneous reasons)?
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,485
Likes
4,111
Location
Pacific Northwest
Your notion of dual conscious experiences is close to what I'm getting at. But it's less philosophical than that. It's more about the limited dimensionality of audio perception. Visually, we can perceive two different things simultaneously: hold them right next to each other and look at both. We can do the same with tactile: feel or touch with each hand simultaneously. But with audio, we can only perceive one sound at a time. If you mix them up, play one in each ear simultaneously, it ruins the test.. So a visual or tactile comparison is more direct; it doesn't rely on memory. An audio comparison necessarily relies on memory. Even instant switching relies on memory; you compare what you're hearing now to your memory of what you were hearing a moment ago. Short-term memory is much more accurate than long-term memory. But it's still a step removed from direct perceptual comparison like we can do with visual or tactile senses.

One might say this is a distinction without a difference. The time delay is so brief, our short-term memory is reliable enough to disregard this. Yet consider: with ABX testing, even a switch delay less than a second can reduce sensitivity. Especially with subtle differences near the threshold of perception, switching delays of say 500ms can reduce sensitivity enough to fail the test. And when performing an ABX test, even with instant switching, we can only hear 1 at a time so we're always comparing what we hear now to what we heard several seconds ago, or longer.

I've said nothing about preferences. I'm only talking about confidence of testing whether we can perceive differences, and asymmetry of false positives vs. false negatives.

However, at the risk of opening a whole 'nuther can of worms and diluting the point, I believe preferences are orthogonal to differences. We can hear differences without having a preference. And we can have a preference without hearing differences. Put differently: preferences can be based on actual hearing perception, but they can also be based on expectation bias, placebo effect, etc.

If a blind test shows a positive result, you know that there is a difference (within the parameters of test confidence), so if you also happen to have a preference you can say within the test confidence it's "real" based on actual hearing. But if a blind test shows a negative result, you simply don't know one way or the other. It's possible you're hearing a real difference too subtle to pass the barrier of short-term memory. It's also possible you're not hearing any difference at all. Either assumption would be a leap of faith not supported by the test.
 
Last edited:

March Audio

Master Contributor
Audio Company
Joined
Mar 1, 2016
Messages
6,378
Likes
9,321
Location
Albany Western Australia
At the risk of resurrecting a stale discussion, in thinking about blind or double-blind testing one of the fundamental differences I think contributes to the debate and sometimes confusion, is the fact that these tests are asymmetric with respect precision and recall. In these tests, precision is known and controllable but recall is unknown. Put differently: you can control and measure the rate of false positives, but you can't measure let alone control false negatives.

You can reduce the rate of false positives by controlling for known factors like level matching, and by doing more trials. The more trials, the higher the precision. Of course the number of trials is limited by listener fatigue, but 95% confidence only requires 5 straight, or 7 of 8, and most listeners can go at least that long before fatigue sets in. So in practice you can get better than 95% confidence of having no false positives.

Problem is, you never know about false negatives. They can happen for any number of factors: switching delay, listener fatigue, stress, appropriateness of musical selection, whatever. But there's no way to measure let alone control them. The crux is we are biologically incapable of properly discerning by listening to both A and B simultaneously, so we must switch back and forth, which is inherently imperfect. It's possible (I would say more than possible, but plausible) that there are differences we can hear but are too subtle to pass through this filter.

So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.

In short, I think A/B/X testing is an important tool that reveals valuable insights. Certainly, participating in them (and training for them) over the years has honed my perception while at the same time making me a bit more humble, and taught me to be a more careful audiophile. Yet this asymmetry of its results is an important limitation to understand.
All this suggests is that if false negatives are really a problem, any actual difference between A and B is so small that it is insignificant. If you need to be in some kind of Zen like state to even detect it, and multiple subjects can't find it, then it just isn't important.

Multiple subjects must reduce the chances of false negatives being an issue, if not then the difference is in the region falling below human detectibility.
 
Last edited:

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,658
Likes
240,924
Location
Seattle Area
So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.
I was with you till you said it tells you nothing. :)

The issues you state are true but we have tools to deal with them. Best description is the international standard, ITU BS1116, Methods for the subjective assessment of small impairments in audio systems

The document is quite approachable so I suggest reading it. For now, the set of tools we use to increase detection are:

1. Pre-introduction to the test. Listeners can as a group evaluate the content, test setup, etc. all sighted (or blind) to become familiar with the content, methods and potential audible differences. They are free to do as much as they like.

2. Controls. Positive control are included to a) assure test is good and b) weed out poor listeners. For example, in a test of transparency of MP3 at 320 kbps versus CD, we include a 64 kbps MP3. The latter has narrower frequency response and much more artifacts. If someone misses that, they will be excluded from the test.

3. Listener training. It is appropriate and indeed recommended that trained listeners be used where possible. Training involves hearing the most extreme version of the distortion under test, and then gradually lowering the level of impairment. After practice, such listeners can reliably tell differences that escape even the most ardent audiophiles.

4. Keeping the length of test short. Trained listeners can be used as samples in advance of the test to see whether they can tolerate the length of the test.

5. Memory aids. Blind testing tools need to have controls where small sections of music can be looped with instantaneous switching between inputs. Short-term memory is very accurate and using this technique, we can put it in control rather than the sloppy long term memory. In my testing of very small impairments, at times I identify a note as short of half a second that I loop to find differences.

Note that such a control does not exist in typical evaluation that audiophiles do. They rely on very long term memory (e.g. listening one song and then repeating) which completely blows away any chance they have of telling small differences.

6. Statistical analysis to catch people who are randomly guessing. Their results can be excluded from the final results if needed.

Following this system as international organizations that develop audio technology do, the level of acuity, i.e. false negative as you call it, is far, far better than mass public and mass audiophiles. See this published test on this: https://www.audiosciencereview.com/...ity-and-reliability-of-abx-blind-testing.186/

So while it is true that no test is perfect, and it is abundantly easy to get negative (or positive results), we do know how to properly conduct such tests so that the results are pretty defensible. Once we combine multiple such tests then we build up confidence in results.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,658
Likes
240,924
Location
Seattle Area
If a blind test shows a positive result, you know that there is a difference (within the parameters of test confidence), so if you also happen to have a preference you can say within the test confidence it's "real" based on actual hearing. But if a blind test shows a negative result, you simply don't know one way or the other. It's possible you're hearing a real difference too subtle to pass the barrier of short-term memory. It's also possible you're not hearing any difference at all. Either assumption would be a leap of faith not supported by the test.
I was having a chat with my doctor once after I grilled him for a while on cause and effect of what was wrong with me. I then stopped and asked him if he gets that from other engineers. He said all the time! He said they see the world as black or white but that medicine is about every shade of gray. You build up confidence towards one outcome and that is that. It may not be 100% but at some point, it is likely to be correct. Is this a flu or just cold? Is it chest pain or heart attack?

Same is here. If we have a bunch of expert/trained listeners and have them listen to two DACs and they can't tell the difference, we know a lot about audibility differences between them. It is possible that the difference exists but unlikely. They are liable to be better than most people so if they can't hear the difference, what chance does the average Joe have? Shades of gray. :)
 
Top Bottom