Limitations of blind testing procedures

oivavoi · May 1, 2017

Cosmik said:
Yes, for there to be an audible difference, at least one of the 'DACs' has to be quite a long way off from being a DAC, and instead must be some sort of effects box - the deviation would clearly show up in measurements so you wouldn't need to do a listening test for difference.

But each to their own: some people may genuinely like the effects box. But my claim is that nothing useful can be gained from a scientific listening test for preference - it is a completely open-ended aesthetic judgement kind of thing, like asking people what their favourite colour is.

Agreed.

Purité Audio · May 1, 2017

Once you precisely level-match and compare unsighted things get really interesting.
https://www.puriteaudio.co.uk/single-post/2017/02/08/Level-matching-for-fun

Keith

Blumlein 88 · May 1, 2017

Purité Audio said:
Once you precisely level-match and compare unsighted things get really interesting.
https://www.puriteaudio.co.uk/single-post/2017/02/08/Level-matching-for-fun

Keith

Looks fine though I would suggest a simpler methodology. I simply measure at the loudspeaker terminals. Your method would be needed if using powered speakers of course.

Also nothing wrong with using a nice voltmeter, but even an inexpensive one is fine for comparative level setting. Even should it be somewhat inaccurate the inaccuracy will be the same for both measurements.

Is that an Audient interface in the picture there?

oivavoi · Jun 11, 2017

Resurrecting this thread, as I just stumbled across a scientific article which seems to me to be pertinent to the discussion here (and for the "Can you trust your ears" and "blind test design" threads as well). It's an article about different kinds of auditory/echoic memory:
"From Sensory to Long-Term Memory: Evidence from Auditory Memory Reactivation Studies"
https://pdfs.semanticscholar.org/2096/25309d5e01183db129fa3a7151945cdf3e8f.pdf

The present "rational consensus" seems to be that
a) the echoic memory is very short, and A/B or ABX-comparisons therefore have to use short excerpts and short time intervals
b) test tones/signals are better suited for finding differences than musical excerpts, since the brain doesn't get fooled by getting drawn into the music

While this is probably valid for many cases, this article nevertheless seems interesting. The claim in the article is that there also exists a long-term auditory memory. They base this on a very intuitive and self-evident fact: That we are able to recognize things like specific voices of people we know, even after a long time. The article is rather technical and a heavy read, but my take-away is that it is easier to remember tones and sounds for a longer time if we can put them into categories/systems/regularities. When we can put sounds in a specific context, we remember them more easily.

What's the relevance for blindtesting? I think that "analytical" blindtesting might take away some of our ability to put sounds within a larger context, and thus make it more difficult to identify differences. I assume that this can be overcome by training on the specific task at hand, but I still think that this easily can mask differences for untrained testees. It also seems to me that ABX tests are way too difficult for the brain to handle, given the limitiations of our auditory memory, and that AB tests would be better.

I also wonder about what kind of differences we might expect to find in AB-comparisons. If our short term acoustic memory is so short, it might be able to spot what I would call "static" differences - that would be differences in frequency response, for example. But what about differences in dynamics? Or transients? Or the time domain in general? Everything which has to do with changes in the music over time seems to me to be rather difficult to capture in short-term listening. Might this go some way towards explaining why Toole and others found that frequency response trumped all other differences in their tests?

Jinjuku · Jun 11, 2017

oivavoi said:
The present "rational consensus" seems to be that
a) the echoic memory is very short, and A/B or ABX-comparisons therefore have to use short excerpts and short time intervals
b) test tones/signals are better suited for finding differences than musical excerpts, since the brain doesn't get fooled by getting drawn into the music

While this is probably valid for many cases, this article nevertheless seems interesting. The claim in the article is that there also exists a long-term auditory memory. They base this on a very intuitive and self-evident fact: That we are able to recognize things like specific voices of people we know, even after a long time. The article is rather technical and and a heavy read, but my take-away is that it is easier to remember tones and sounds for a longer time if we can put them into categories/systems/regularities. When we can put sounds in a specific context, we remember them more easily.

What's the relevance for blindtesting? I think that "analytical" blindtesting might take away some of our ability to put sounds within a larger context, and thus make it more difficult to identify differences. I assume that this can be overcome by training on the specific task at hand, but I still think that this easily can mask differences for untrained testees. It also seems to me that ABX tests are way too difficult for the brain to handle, given the limitiations of our auditory memory, and that AB tests would be better.

Here is where most subjective opponents of bias control measures get it wrong. I'm out to test peoples claims that their abilities are contrary to the above.

It's like someone that says they can jump straight up 20'. I'm going to put a bar up (the paper you linked to) but it's not the bar that is being tested.

This also means it's equally difficult to identify differences using their sighted methods but somehow they come up with hearing differences (including with components that are 100% proven unable to make differences).

Jakob1863 · Jun 11, 2017

@Jinjuku,

at which point do you think the claims are about abilities "contrary to the above" ?

Jakob1863 · Jun 11, 2017

As said before, to choose the methodology is a crucial point in any test and it depends strongly on the hypothesis/question under examination.
Wrt practical relevance, means "normal listening" to music at home, it intuitively can´t make much sense to rely on short music samples and fast switching from one DUT ot another. If you can´t remember something you would not observe if it has changed the next time listening to music.

But, if you want to find confirmation for any hypothesis you have to do experiments and have to realize that these tests (using human listeners) are social/behavorial experiments at the time as well. So after a couple of years and observing the difficulties a lot of listeners usually had/have with "unusual conditions" i tried to use another approach with two preamplifiers in which the participants were not knowing being part of a test:

Both DUTs measured well and quite similar (not unintentionally as that was a design goal ) .

That means frequency response error was below +- 0.02dBr, Hum and noise around -106dBr unweigthed ref. 1V (BW22kHz), THD+N ~ 0.0009% (BW80kHz) @47kOhm/3V/10Hz-20kHz, IMD<0.001%, crosstalk -90dBr .

Both units were dc-coupled with servos; we were not able to find two identical responding alps potentiometers, so the balance error was 0.11dB in one unit and 0.18dB in the other.

Of course there are differences if one diggs a bit deeper, but correlation does not necessarily mean causality.

-------------------------------------------------------------------------------

Two identical looking cases with labeling changed in random order between the participants and used for a preference test. One of the screws was invisibly sealed, but as the looks inside were extremely similar it would have probably not help to establish a preference for one of the units.

Every participant got the units for a couple of days and should tell afterwards which one (if one) he would prefer for listening.
Due to the ´hiding of the test´ it was not possible to ask the participants for a run of trials therefore the main problem was to find a group with consistent preference.

We were only able to find five listeners, who would in my opinion prefer the same preamplifier as i did (doing a controlled blind experiment , identifying my preferred unit correctly in 5 trials), if they were detecting a audible difference.

I did not know which label belonged to which version when delivering the units to the listeners.
Over two months we collected the results and after revealing the random labeling order it occured that all had choosen the same unit.

--------------------------------------------------------------------------------

The experiment mimicked what i´d consider to be the normal routine when comparing two devices in sighted listening.
-) the control in such a situation can´t be as tight as in a labor situation, but 3 listeners were dealers, 1 listener was a loudspeaker developer and 1 a consumer. None of them could have done measurements with resolution good enough to reveal the differences below the specs mentioned above
-) it would have been better to include a questionnaire to get more informations about the description of the sonic differences. I opted against because we had to define a set of describing words before to ensure that the answers were comparable
-) no switching devices were used

oivavoi · Jun 11, 2017

Jakob1863 said:
As said before, to choose the methodology is a crucial point in any test and it depends strongly on the hypothesis/question under examination.
Wrt practical relevance, means "normal listening" to music at home, it intuitively can´t make much sense to rely on short music samples and fast switching from one DUT ot another. If you can´t remember something you would not observe if it has changed the next time listening to music.

But, if you want to find confirmation for any hypothesis you have to do experiments and have to realize that these tests (using human listeners) are social/behavorial experiments at the time as well. So after a couple of years and observing the difficulties a lot of listeners usually had/have with "unusual conditions" i tried to try another approach with two preamplifiers in which the participants were not knowing being part of a test:

Both DUTs measured well and quite similar (not unintentionally as that was a design goal ) .

That means frequency response error was below +- 0.02dBr, Hum and noise around -106dBr unweigthed ref. 1V (BW22kHz), THD+N ~ 0.0009% (BW80kHz) @47kOhm/3V/10Hz-20kHz, IMD<0.001%, crosstalk -90dBr .

Both units were dc-coupled with servos; we were not able to find two identical responding alps potentiometers, so the balance error was 0.11dB in one unit and 0.18dB in the other.

Of course there are differences if one diggs a bit deeper, but correlation does not necessarily mean causality.

-------------------------------------------------------------------------------

Two identical looking cases with labeling changed in random order between the participants and used for a preference test. One of the screws was invisibly sealed, but as the looks inside were extremely similar it would have probably not help to establish a preference for one of the units.

Every participant got the units for a couple of days and should tell afterwards which one (if one) he would prefer for listening.
Due to the ´hiding of the test´ it was not possible to ask the participants for a run of trials therefore the main problem was to find a group with consistent preference.

We were only able to find five listeners, who would in my opinion prefer the same preamplifier as i did (doing a controlled blind experiment , identifying my preferred unit correctly in 5 trials), if they were detecting a audible difference.

I did not know which label belonged to which version when delivering the units to the listeners.
Over two months we collected the results and after revealing the random labeling order it occured that all had choosen the same unit.

--------------------------------------------------------------------------------

The experiment mimicked what i´d consider to be the normal routine when comparing two devices in sighted listening.
-) the control in such a situation can´t be as tight as in a labor situation, but 3 listeners were dealers, 1 listener was a loudspeaker developer and 1 a consumer. None of them could have done measurements with resolution good enough to reveal the differences below the specs mentioned above
-) it would have been better to include a questionnaire to get more informations about the description of the sonic differences. I opted against because we had to define a set of describing words before to ensure that the answers were comparable
-) no switching devices were used

Very interesting, Jakob1863! Given that these preamps measured almost identically: What do you think is the most reasonable explanation for why one preamp was preferred over the other?

Fitzcaraldo215 · Jun 11, 2017

oivavoi said:
Resurrecting this thread, as I just stumbled across a scientific article which seems to me to be pertinent to the discussion here (and for the "Can you trust your ears" and "blind test design" threads as well). It's an article about different kinds of auditory/echoic memory:
"From Sensory to Long-Term Memory: Evidence from Auditory Memory Reactivation Studies"
https://pdfs.semanticscholar.org/2096/25309d5e01183db129fa3a7151945cdf3e8f.pdf

The present "rational consensus" seems to be that
a) the echoic memory is very short, and A/B or ABX-comparisons therefore have to use short excerpts and short time intervals
b) test tones/signals are better suited for finding differences than musical excerpts, since the brain doesn't get fooled by getting drawn into the music

While this is probably valid for many cases, this article nevertheless seems interesting. The claim in the article is that there also exists a long-term auditory memory. They base this on a very intuitive and self-evident fact: That we are able to recognize things like specific voices of people we know, even after a long time. The article is rather technical and a heavy read, but my take-away is that it is easier to remember tones and sounds for a longer time if we can put them into categories/systems/regularities. When we can put sounds in a specific context, we remember them more easily.

What's the relevance for blindtesting? I think that "analytical" blindtesting might take away some of our ability to put sounds within a larger context, and thus make it more difficult to identify differences. I assume that this can be overcome by training on the specific task at hand, but I still think that this easily can mask differences for untrained testees. It also seems to me that ABX tests are way too difficult for the brain to handle, given the limitiations of our auditory memory, and that AB tests would be better.

I also wonder about what kind of differences we might expect to find in AB-comparisons. If our short term acoustic memory is so short, it might be able to spot what I would call "static" differences - that would be differences in frequency response, for example. But what about differences in dynamics? Or transients? Or the time domain in general? Everything which has to do with changes in the music over time seems to me to be rather difficult to capture in short-term listening. Might this go some way towards explaining why Toole and others found that frequency response trumped all other differences in their tests?

I like the paper's conclusions very much, maybe because I have developed those same inklings a long time ago from other readings and from my own experience.

I also agree with your comment about ABX testing, which I think is problematical. But, I am not sure there is better depending on the hypothesis you wish to test. However, I do not see ABX as a panacea. A major part of the problem with ABX, actually any comparisons, in audio is the time sensitive nature of music, sound and the rapid decay of detailed perceptions in our aural memory. That is one reason I, personally, believe there is often a distinct bias in ABX toward the "no detectable difference better than chance" result.

Or, even without ABX, in other AB audio comparisons, whether for controlled testing or not, the short acoustic memory problem may bias results to an "they all sound the same" outcome. Of course, cognitive bias, especially sighted, wants to pull things the other way, unless that can be removed in controlled tests.

I always find music very troublesome for comparisons because of its serial nature, time duration and our emotional involvement with it.

No matter how you cut it, whether for formal audio testing or informal equipment comparisons, listening comparisons are just very problematical. But, I do not think comparative listening is useless, either, imperfections and all.

Purité Audio · Jun 11, 2017

Unsighted level matched there is no other way, if you know what you are listening to the game is up.
Keith

oivavoi · Jun 11, 2017

Purité Audio said:
Unsighted level matched there is no other way, if you know what you are listening to the game is up.
Keith

I think we all agree (at this forum at least) that sighted listening easily can be influenced by bias. But I don't agree that sighted listening can't be trusted at all. There is a difference between stating the our perception can be biased, and that our perception is completely unreliable. The first claim is supported by science and common sense, the other is not.

Fitzcaraldo215 said:
I like the paper's conclusions very much, maybe because I have developed those same inklings a long time ago from other readings and from my own experience.

I also agree with your comment about ABX testing, which I think is problematical. But, I am not sure there is better depending on the hypothesis you wish to test. However, I do not see ABX as a panacea. A major part of the problem with ABX, actually any comparisons, in audio is the time sensitive nature of music, sound and the rapid decay of detailed perceptions in our aural memory. That is one reason I, personally, believe there is often a distinct bias in ABX toward the "no detectable difference better than chance" result.

Or, even without ABX, in other AB audio comparisons, whether for controlled testing or not, the short acoustic memory problem may bias results to an "they all sound the same" outcome. Of course, cognitive bias, especially sighted, wants to pull things the other way, unless that can be removed in controlled tests.

I always find music very troublesome for comparisons because of its serial nature, time duration and our emotional involvement with it.

No matter how you cut it, whether for formal audio testing or informal equipment comparisons, listening comparisons are just very problematical. But, I do not think comparative listening is useless, either, imperfections and all.

Agree very much. There is no easy way out, unfortunately. Sighted listening will often be biased, but blind tests may not reveal the whole picture either. As for me, the best approach for achieving good sound seems to be a non-dogmatic approach which combines different sources of data and input: Measurements, common sense and rationality, rigorous and scientific blind testing, but also supplemented by subjective and sighted evaluations of audio systems.

Purité Audio · Jun 11, 2017

Why wouldn't an unsighted rest reveal the 'whole picture' , just look at Toole's research with loudspeakers .
Keith

oivavoi · Jun 11, 2017

Purité Audio said:
Why wouldn't an unsighted rest reveal the 'whole picture' , just look at Toole's research with loudspeakers .
Keith

I wrote something about that in my post above, but I can repeat it:
"I also wonder about what kind of differences we might expect to find in AB-comparisons. If our short term acoustic memory is so short, it might be able to spot what I would call "static" differences - that would be differences in frequency response, for example. But what about differences in dynamics? Or transients? Or the time domain in general? Everything which has to do with changes in the music over time seems to me to be rather difficult to capture in short-term listening. Might this go some way towards explaining why Toole and others found that frequency response trumped all other differences in their tests?"

Phase coherency, for example, is something which Toole in his experiments deemed not to be important, but which is a very important feature of the Kii Three's. One of the guiding principles in the development of the Kii's thus seems to be that Toole's experiments didn't reveal the whole picture. Btw, the problem is not that tests are "unsighted", but rather the other things which I wrote about above.

RayDunzl · Jun 11, 2017

oivavoi said:
Phase coherency, for example, is something which Toole in his experiments deemed not to be important, but which is a very important feature of the Kii Three's.

Well...

To perform its directional magic it has to have the phase/delay relationship of the front/side/rear under control... Beyond that, not sure.

So, not having seen much measurement of the devices yet, found this right away when I looked just now.

https://www.gearslutz.com/board/12588247-post674.html

I'll see if I can download the .mdat and take a peek.

oivavoi · Jun 11, 2017

RayDunzl said:
Well...

To perform its directional magic it has to have the phase/delay relationship of the front/side/rear under control... Beyond that, not sure.

So, not having seen much measurement of the devices yet, found this right away when I looked just now.

https://www.gearslutz.com/board/12588247-post674.html

I'll see if I can download the .mdat and take a peek.

Interesting. I have no idea how the Kii's actually work. Most of the reviews I've read make a big deal of the phase thing, hence my comment.

Jakob1863 · Jun 11, 2017

Purité Audio said:
Unsighted level matched there is no other way, if you know what you are listening to the game is up.
Keith

It depends on the listener, because it is a matter of bias control. Humans are able to learn to handle bias impact up to a certain degree. If they weren´t "unsighted" listening wouldn´t work either, as there are still numerous bias mechanism at work. Unfortunately people often forget about that .....

Purité Audio · Jun 11, 2017

I have experienced on many occasions a 'perceived' sighted difference which disappears when compared unsighted .
The Kii's are completely phase coherent, under 'normal' latency mode the only speakers which are apparently. When you engage 'low' latency you lose that but to be honest I don't really notice much of a difference I have heard that some listeners are sensitive to phase.
Keith

tomelex · Jun 11, 2017

Perhaps, if you guys went out and bought a parametric equalizer or used one on your computer, and tried changing how your favorite songs sounded, you might come to terms with how easy it is to change audio, and how many permutations can all sound equally good, and how you can on your favorite song change the way it is equalized every day and still have a good experience. I say this, not to discount blind tests, but to acknowledge that WE are the variable here, and the mega buck unit that sounded great to you today, could very well not sound as great a week from now. I also say this to say that, minute variations on your favorite song, using the parametric equalizer, will not be audible to you, yet the song is not the same, kind of like splitting hairs on which good quality gear sounds "better". To see results, you need to be listening to different amp topologies, different speakers, etc, and all the time, with your head locked in a vice, as movement of 6 inches can change the level of sound by 6 db depending on your room modes, etc.

Don't forget, if your of advancing years, you just threw out everything above what, say 6Khz, or 8Khz, or whatever, you pick. The equalizer will tell you a lot about your hearing or lack thereof. I agree that it is important to purchase something you like the way it looks and feel the value reaches your threshold, and if you look to the specs, and they are about the same as the mega priced units, you are in pretty good shape anyway.

Blumlein 88 · Jun 11, 2017

Jakob1863 said:
As said before, to choose the methodology is a crucial point in any test and it depends strongly on the hypothesis/question under examination.
Wrt practical relevance, means "normal listening" to music at home, it intuitively can´t make much sense to rely on short music samples and fast switching from one DUT ot another. If you can´t remember something you would not observe if it has changed the next time listening to music.

But, if you want to find confirmation for any hypothesis you have to do experiments and have to realize that these tests (using human listeners) are social/behavorial experiments at the time as well. So after a couple of years and observing the difficulties a lot of listeners usually had/have with "unusual conditions" i tried to use another approach with two preamplifiers in which the participants were not knowing being part of a test:

Both DUTs measured well and quite similar (not unintentionally as that was a design goal ) .

That means frequency response error was below +- 0.02dBr, Hum and noise around -106dBr unweigthed ref. 1V (BW22kHz), THD+N ~ 0.0009% (BW80kHz) @47kOhm/3V/10Hz-20kHz, IMD<0.001%, crosstalk -90dBr .

Both units were dc-coupled with servos; we were not able to find two identical responding alps potentiometers, so the balance error was 0.11dB in one unit and 0.18dB in the other.

Of course there are differences if one diggs a bit deeper, but correlation does not necessarily mean causality.

-------------------------------------------------------------------------------

Two identical looking cases with labeling changed in random order between the participants and used for a preference test. One of the screws was invisibly sealed, but as the looks inside were extremely similar it would have probably not help to establish a preference for one of the units.

Every participant got the units for a couple of days and should tell afterwards which one (if one) he would prefer for listening.
Due to the ´hiding of the test´ it was not possible to ask the participants for a run of trials therefore the main problem was to find a group with consistent preference.

We were only able to find five listeners, who would in my opinion prefer the same preamplifier as i did (doing a controlled blind experiment , identifying my preferred unit correctly in 5 trials), if they were detecting a audible difference.

I did not know which label belonged to which version when delivering the units to the listeners.
Over two months we collected the results and after revealing the random labeling order it occured that all had choosen the same unit.

--------------------------------------------------------------------------------

The experiment mimicked what i´d consider to be the normal routine when comparing two devices in sighted listening.
-) the control in such a situation can´t be as tight as in a labor situation, but 3 listeners were dealers, 1 listener was a loudspeaker developer and 1 a consumer. None of them could have done measurements with resolution good enough to reveal the differences below the specs mentioned above
-) it would have been better to include a questionnaire to get more informations about the description of the sonic differences. I opted against because we had to define a set of describing words before to ensure that the answers were comparable
-) no switching devices were used

So can you fill in more detailed info about this. How many total listeners were involved?

Something I am unclear about, did each participant get the units once, and make a choice? Or did they get them multiple times? Or did only people picking the unit you picked get additional choices?

Fitzcaraldo215 · Jun 11, 2017

Purité Audio said:
I have experienced on many occasions a 'perceived' sighted difference which disappears when compared unsighted .
The Kii's are completely phase coherent, under 'normal' latency mode the only speakers which are apparently. When you engage 'low' latency you lose that but to be honest I don't really notice much of a difference I have heard that some listeners are sensitive to phase.
Keith

Were you sighted or blind when you made that comparison? Just kidding, Keith, but wait a second.

We both know that, as a practical matter, comprehensive blind tests are usually just impossible. If you can state that your dealership is equipped for rapid switching between level-matched speakers for comparison in double blind, I would be thoroughly impressed. Actually, I would be hugely impressed even if you could only compare electronics that way in your showroom via a single speaker pair.

I think @oivavoi and I are agreeing that there sure are major flaws with sighted listening comparisons, but that even seemingly rigorous DBTs are not necessarily perfect either. The rapid decay of acoustic memory is one of the problems, but there are others. There are risks in total reliance on DBTs and the assumption that they automatically provide a 100% accurate

A case study illustrating that is the widely cited Meyer-Moran study of a decade or so ago. It, via seemingly thorough "pristine" test and statistical analysis methods, "proved" there was no audible difference between CDs and SACDs. Since having been published, however, Amir and many other audio experts have debunked its methodology and its conclusions as seriously flawed. You could look up Amir's comments on it in his reference library here.

I know in these "debates" even a little bit of concession to pragmatism may be considered a mortal sin. But, we just don't have the published blind comparisons on the vast majority of equipment or the wide availability of facilities, like dealerships, where we could perform "perfect" blind tests on our own. So, what are we to do?

I don't disagree that comprehensive well done tests, like Harman's, are great. But, since your Kii's or Dutch's or whatever were not tested in that protocol, what are we to make of them and how are we to assess them?

Limitations of blind testing procedures

Major Contributor

Master Contributor

Grand Contributor

Major Contributor

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Major Contributor

Major Contributor

Master Contributor

Major Contributor

Master Contributor

Major Contributor

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Master Contributor

Addicted to Fun and Learning

Grand Contributor

Major Contributor

Similar threads