Blind test - objectivists with tin hearing?

Don Hills · Jan 14, 2019

Long, but worth reading in full, from a well regarded audio engineer:

"Science and Subjectivism in Audio", by Douglas Self

Note that the article was originally published 30 years ago...

sergeauckland · Jan 14, 2019

And as true today as it was 30 years ago. In some respects even more so given the increase in Foo products since then.

S

Don Hills · Jan 14, 2019

Worse, he was bemoaning a state of affairs that began 20 years before he wrote it.

Phronesis · Jan 16, 2019

I concur with the comment that expectations of there being no difference will influence how someone approaches a listening test, and blinding won't solve that problem.

If someone expects A and B to sound the same, they'll tend to perceive in a way that they sound the same, and if they suspect that they might be hearing differences, they'll tend to conclude that there aren't really any differences. When we make these comparisons, due to the fallibility of perception and memory, there's usually some uncertainty about whether there are differences (especially when actual differences are small), so we can tip towards perceiving or not perceiving differences based on our expectations.

Another aspect is that, even if differences audible to a given listener exist, whether the listener detects them will be influenced by the specific differences they expect to hear (e.g., if someone expects to hear a difference in level of detail, they may miss a difference in the amount of bass).

SIY · Jan 17, 2019

Let me ask an obvious question: if I want to determine whether A and B sound different, why in the world would I use someone as a test subject for a double blind test who doesn't hear a different between them sighted? Likewise, if I don't hear a difference between A and B, why would I run a double blind test on myself? It makes no sense.

MRC01 · Feb 8, 2019

amirm said:
... The point of a blind test is to measure the impact of sound alone. As with the argument above where the objectivist wanted to show there was no audible difference, telling objectivists "here comes a cable test" may shut them down and want to vote with their opinion, not what they could have heard.

At the risk of resurrecting a stale discussion, in thinking about blind or double-blind testing one of the fundamental differences I think contributes to the debate and sometimes confusion, is the fact that these tests are asymmetric with respect precision and recall. In these tests, precision is known and controllable but recall is unknown. Put differently: you can control and measure the rate of false positives, but you can't measure let alone control false negatives.

You can reduce the rate of false positives by controlling for known factors like level matching, and by doing more trials. The more trials, the higher the precision. Of course the number of trials is limited by listener fatigue, but 95% confidence only requires 5 straight, or 7 of 8, and most listeners can go at least that long before fatigue sets in. So in practice you can get better than 95% confidence of having no false positives.

Problem is, you never know about false negatives. They can happen for any number of factors: switching delay, listener fatigue, stress, appropriateness of musical selection, whatever. But there's no way to measure let alone control them. The crux is we are biologically incapable of properly discerning by listening to both A and B simultaneously, so we must switch back and forth, which is inherently imperfect. It's possible (I would say more than possible, but plausible) that there are differences we can hear but are too subtle to pass through this filter.

So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.

In short, I think A/B/X testing is an important tool that reveals valuable insights. Certainly, participating in them (and training for them) over the years has honed my perception while at the same time making me a bit more humble, and taught me to be a more careful audiophile. Yet this asymmetry of its results is an important limitation to understand.

flipflop · Feb 8, 2019

MRC01 said:
the number of trials is limited by listener fatigue

Wrong. You're allowed to take as many breaks as you want during a DBT.

MRC01 · Feb 8, 2019

PS: in fewer words
My first reaction is to say DBT is "high precision, low recall". But that's not quite correct. It's more accurate to say it's "Controlled precision, unknown recall."

jsrtheta · Feb 8, 2019

MRC01 said:
At the risk of resurrecting a stale discussion, in thinking about blind or double-blind testing one of the fundamental differences I think contributes to the debate and sometimes confusion, is the fact that these tests are asymmetric with respect precision and recall. In these tests, precision is known and controllable but recall is unknown. Put differently: you can control and measure the rate of false positives, but you can't measure let alone control false negatives.

95% confidence only requires 5 straight, or 7 of 8, and most listeners can go at least that long before fatigue sets in. So in practice you can get better than 95% confidence of having no false positives.

Problem is, you never know about false negatives. They can happen for any number of factors: switching delay, listener fatigue, stress, appropriateness of musical selection, whatever.

It's possible (I would say more than possible, but plausible) that there are differences we can hear but are too subtle to pass through this filter

You sure have a lot of excuses here. If ABX is employed properly, there is no switching delay. Nor are all the trials required to be performed in the same session, so I do not see how there is any "listener fatigue". And the minimum number of tests is 10, for which 9 correct gives a score of 95% confidence, not "5 straight, or 7 of 8". And the listener can pick his own musical material for use in the test.

And the "differences we can hear but are too subtle to pass through this filter" trope is just another way of saying "I can hear differences, but they are too subtle to hear."

MRC01 · Feb 8, 2019

jsrtheta said:
You sure have a lot of excuses here. If ABX is employed properly, there is no switching delay. Nor are all the trials required to be performed in the same session, so I do not see how there is any "listener fatigue". And the minimum number of tests is 10, for which 9 correct gives a score of 95% confidence, not "5 straight, or 7 of 8". And the listener can pick his own musical material for use in the test.
...

It's not about excuses. I'm a proponent of ABX. But it's important to understand its limitations. I can think of only 2 objections to this:
1. Claim that false negatives cannot happen.
2. Claim that ABX tests can detect or control for false negatives.
Neither of these seems plausible. Can you imagine other objections?

I believe you have the math incorrect. Here's where I got those numbers: check my math. If you have X trials, each 50% chance to guess correctly, and got Y of them correct, the formula is:
(X choose Y) * (1/2)^Y * (1/2)^(X-Y)
That means
5 in a row is 96.9% confidence (3.1% chance to get them right by guessing)
7 of 7 is 99.2% (0.8% chance to get them right by guessing)
7 of 8 is 96.5% (3.5% chance to get them right by guessing)
Your example 9 of 10 would be 98.9% confidence.

andreasmaaan · Feb 8, 2019

Cosmik said:
It's not objectivism. People interested in listening tests are not objectivists.

This is objectivism:

https://www.puriteaudio.co.uk/forum/everything-else/peter-walker-quad-on-design

Hmm. Why should it matter that the signals they are using are not music. They are still basing their design decisions on how the system subjectively sounds. It's not even clear from that quote that they are performing these listening tests in a controlled manner.

SIY · Feb 8, 2019

Why do you restrict double blind to the ABX format? There’s lots of other ones, the control factor being “double blind.” One chooses the appropriate format for the question being studied.

I’ve never seen anyone who actually does sensory testing ever claim that false negatives are impossible. But when a proposition is tested over and over by lots of different people under lots of different conditions and the results are always negative, it’s a pretty safe presumption that the negative result is correct.

jsrtheta · Feb 8, 2019

MRC01 said:
It's not about excuses. I'm a proponent of ABX. But it's important to understand its limitations. I can think of only 2 objections to this:
1. Claim that false negatives cannot happen.
2. Claim that ABX tests can detect or control for false negatives.
Neither of these seems plausible. Can you imagine other objections?

I believe you have the math incorrect. Here's where I got those numbers: check my math. If you have X trials, each 50% chance to guess correctly, and got Y of them correct, the formula is:
(X choose Y) * (1/2)^Y * (1/2)^(X-Y)
That means
5 in a row is 96.9% confidence (3.1% chance to get them right by guessing)
7 of 7 is 99.2% (0.8% chance to get them right by guessing)
7 of 8 is 96.5% (3.5% chance to get them right by guessing)
Your example 9 of 10 would be 98.9% confidence.

The protocol for such testing is a minimum of 10 tests. Read up on how DBTs work.

MRC01 · Feb 8, 2019

I believe your statement that "the minimum number of tests is 10, for which 9 correct gives a score of 95% confidence," is incorrect and explained why. Do you see a problem with my math?
If your goal is 95% confidence allowing 1 mistake, the minimum # of trials is 8. That's because 7 of 8 is 96.5%.
7 trials is not enough; if you miss one, 6 of 7 is only 93.8%.
Of course there's nothing magic about 95%. You can choose any confidence level you want, and apply the math to determine how many trials you need. Or just do the trials then compute your confidence from the results.

However, it's possible that some researchers use a different formula based on different assumptions. I'm assuming each trial is a 2-way A/B comparison so every choice is binary. You have a 50% chance to get each individual trial correct by guessing.

MRC01 · Feb 8, 2019

jsrtheta said:
...And the "differences we can hear but are too subtle to pass through this filter" trope is just another way of saying "I can hear differences, but they are too subtle to hear."

That's amusing, but it's a bit more subtle. Since we can't discern A and B simultaneously, we must switch back and forth, which requires memory, which is a step removed from direct perception. It's at least plausible, though certainly not provable, that we can discern more subtle differences though direct perception, than we can recall through memory of perception.

So the statement I could agree with is: "I might be hearing differences that are too subtle to discern through memory, rather than immediate perception". Even if the ABX test has zero switch delay, you are still comparing what you are directly perceiving, with your recent memory of what you perceived before. That's a step removed from comparing directly. Since ABX switching delays are well-known to reduce test sensitivity, it's at least plausible that memory of perception is inherently less sensitive than direct perception.

I don't claim that these false negatives definitely exist. I think the evidence from switch delays suggests it's plausible, but equally well informed people may disagree on that! My claim is that blind testing (in ABX or any other format) can't detect false negatives, so whether they exist is an open question. This precision vs. recall distinction makes blind testing asymmetric with respect to false positives & false negatives, which creates some confusion and endless debate.

andreasmaaan · Feb 8, 2019

MRC01 said:
That's amusing, but it's a bit more subtle. Since we can't discern A and B simultaneously, we must switch back and forth, which requires memory, which is a step removed from direct perception. It's at least plausible, though certainly not provable, that we can discern more subtle differences though direct perception, than we can recall through memory of perception.

So the statement I could agree with is: "I might be hearing differences that are too subtle to discern through memory, rather than immediate perception". Even if the ABX test has zero switch delay, you are still comparing what you are directly perceiving, with your recent memory of what you perceived before. That's a step removed from comparing directly. Since ABX switching delays are well-known to reduce test sensitivity, it's at least plausible that memory of perception is inherently less sensitive than direct perception.

I don't claim that these false negatives definitely exist. I think the evidence from switch delays suggests it's plausible, but equally well informed people may disagree on that! My claim is that blind testing (in ABX or any other format) can't detect false negatives, so whether they exist is an open question. This precision vs. recall distinction makes blind testing asymmetric with respect to false positives & false negatives, which creates some confusion and endless debate.

It’s an interesting argument. If I understand correctly, you’re saying that DBT’s render people unable to discern differences that they may be able to discern only if they were able to have dual conscious experiences simultaneously.

This argument seems to concede that, given the limitation of being able to have only one conscious experience at a time, humans are not able to discern differences between X and Y.

And yet you seem to be saying that, despite being unable to discern any difference, one can have a preference that stands indepemtly of the kinds of extraneous factors (e.g. expectation bias, placebo effect, etc.) that DBTs control for.

If this is so, how can you escape saying that it’s possible to prefer one of two things between which (given you are limited to one conscious experience at a time) you can’t discern any difference (other than for extraneous reasons)?

MRC01 · Feb 8, 2019

Your notion of dual conscious experiences is close to what I'm getting at. But it's less philosophical than that. It's more about the limited dimensionality of audio perception. Visually, we can perceive two different things simultaneously: hold them right next to each other and look at both. We can do the same with tactile: feel or touch with each hand simultaneously. But with audio, we can only perceive one sound at a time. If you mix them up, play one in each ear simultaneously, it ruins the test.. So a visual or tactile comparison is more direct; it doesn't rely on memory. An audio comparison necessarily relies on memory. Even instant switching relies on memory; you compare what you're hearing now to your memory of what you were hearing a moment ago. Short-term memory is much more accurate than long-term memory. But it's still a step removed from direct perceptual comparison like we can do with visual or tactile senses.

One might say this is a distinction without a difference. The time delay is so brief, our short-term memory is reliable enough to disregard this. Yet consider: with ABX testing, even a switch delay less than a second can reduce sensitivity. Especially with subtle differences near the threshold of perception, switching delays of say 500ms can reduce sensitivity enough to fail the test. And when performing an ABX test, even with instant switching, we can only hear 1 at a time so we're always comparing what we hear now to what we heard several seconds ago, or longer.

I've said nothing about preferences. I'm only talking about confidence of testing whether we can perceive differences, and asymmetry of false positives vs. false negatives.

However, at the risk of opening a whole 'nuther can of worms and diluting the point, I believe preferences are orthogonal to differences. We can hear differences without having a preference. And we can have a preference without hearing differences. Put differently: preferences can be based on actual hearing perception, but they can also be based on expectation bias, placebo effect, etc.

If a blind test shows a positive result, you know that there is a difference (within the parameters of test confidence), so if you also happen to have a preference you can say within the test confidence it's "real" based on actual hearing. But if a blind test shows a negative result, you simply don't know one way or the other. It's possible you're hearing a real difference too subtle to pass the barrier of short-term memory. It's also possible you're not hearing any difference at all. Either assumption would be a leap of faith not supported by the test.

March Audio · Feb 8, 2019

MRC01 said:
At the risk of resurrecting a stale discussion, in thinking about blind or double-blind testing one of the fundamental differences I think contributes to the debate and sometimes confusion, is the fact that these tests are asymmetric with respect precision and recall. In these tests, precision is known and controllable but recall is unknown. Put differently: you can control and measure the rate of false positives, but you can't measure let alone control false negatives.

You can reduce the rate of false positives by controlling for known factors like level matching, and by doing more trials. The more trials, the higher the precision. Of course the number of trials is limited by listener fatigue, but 95% confidence only requires 5 straight, or 7 of 8, and most listeners can go at least that long before fatigue sets in. So in practice you can get better than 95% confidence of having no false positives.

Problem is, you never know about false negatives. They can happen for any number of factors: switching delay, listener fatigue, stress, appropriateness of musical selection, whatever. But there's no way to measure let alone control them. The crux is we are biologically incapable of properly discerning by listening to both A and B simultaneously, so we must switch back and forth, which is inherently imperfect. It's possible (I would say more than possible, but plausible) that there are differences we can hear but are too subtle to pass through this filter.

So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.

In short, I think A/B/X testing is an important tool that reveals valuable insights. Certainly, participating in them (and training for them) over the years has honed my perception while at the same time making me a bit more humble, and taught me to be a more careful audiophile. Yet this asymmetry of its results is an important limitation to understand.

All this suggests is that if false negatives are really a problem, any actual difference between A and B is so small that it is insignificant. If you need to be in some kind of Zen like state to even detect it, and multiple subjects can't find it, then it just isn't important.

Multiple subjects must reduce the chances of false negatives being an issue, if not then the difference is in the region falling below human detectibility.

amirm · Feb 9, 2019

MRC01 said:
So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.

I was with you till you said it tells you nothing.

The issues you state are true but we have tools to deal with them. Best description is the international standard, ITU BS1116, Methods for the subjective assessment of small impairments in audio systems

The document is quite approachable so I suggest reading it. For now, the set of tools we use to increase detection are:

1. Pre-introduction to the test. Listeners can as a group evaluate the content, test setup, etc. all sighted (or blind) to become familiar with the content, methods and potential audible differences. They are free to do as much as they like.

2. Controls. Positive control are included to a) assure test is good and b) weed out poor listeners. For example, in a test of transparency of MP3 at 320 kbps versus CD, we include a 64 kbps MP3. The latter has narrower frequency response and much more artifacts. If someone misses that, they will be excluded from the test.

3. Listener training. It is appropriate and indeed recommended that trained listeners be used where possible. Training involves hearing the most extreme version of the distortion under test, and then gradually lowering the level of impairment. After practice, such listeners can reliably tell differences that escape even the most ardent audiophiles.

4. Keeping the length of test short. Trained listeners can be used as samples in advance of the test to see whether they can tolerate the length of the test.

5. Memory aids. Blind testing tools need to have controls where small sections of music can be looped with instantaneous switching between inputs. Short-term memory is very accurate and using this technique, we can put it in control rather than the sloppy long term memory. In my testing of very small impairments, at times I identify a note as short of half a second that I loop to find differences.

Note that such a control does not exist in typical evaluation that audiophiles do. They rely on very long term memory (e.g. listening one song and then repeating) which completely blows away any chance they have of telling small differences.

6. Statistical analysis to catch people who are randomly guessing. Their results can be excluded from the final results if needed.

Following this system as international organizations that develop audio technology do, the level of acuity, i.e. false negative as you call it, is far, far better than mass public and mass audiophiles. See this published test on this: https://www.audiosciencereview.com/...ity-and-reliability-of-abx-blind-testing.186/

So while it is true that no test is perfect, and it is abundantly easy to get negative (or positive results), we do know how to properly conduct such tests so that the results are pretty defensible. Once we combine multiple such tests then we build up confidence in results.

amirm · Feb 9, 2019

MRC01 said:
If a blind test shows a positive result, you know that there is a difference (within the parameters of test confidence), so if you also happen to have a preference you can say within the test confidence it's "real" based on actual hearing. But if a blind test shows a negative result, you simply don't know one way or the other. It's possible you're hearing a real difference too subtle to pass the barrier of short-term memory. It's also possible you're not hearing any difference at all. Either assumption would be a leap of faith not supported by the test.

I was having a chat with my doctor once after I grilled him for a while on cause and effect of what was wrong with me. I then stopped and asked him if he gets that from other engineers. He said all the time! He said they see the world as black or white but that medicine is about every shade of gray. You build up confidence towards one outcome and that is that. It may not be 100% but at some point, it is likely to be correct. Is this a flu or just cold? Is it chest pain or heart attack?

Same is here. If we have a bunch of expert/trained listeners and have them listen to two DACs and they can't tell the difference, we know a lot about audibility differences between them. It is possible that the difference exists but unlikely. They are liable to be better than most people so if they can't hear the difference, what chance does the average Joe have? Shades of gray.

Blind test - objectivists with tin hearing?

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Member

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Major Contributor

Master Contributor

Grand Contributor

Addicted to Fun and Learning

Major Contributor

Major Contributor

Master Contributor

Major Contributor

Master Contributor

Founder/Admin

Founder/Admin

Similar threads