Limitations of blind testing procedures

Thomas savage · Jul 28, 2017

Jakob1863 · Jul 30, 2017

Thomas savage said:
I think Iv afforded you fair rein , as I said I would. People will read the arguments ( they might actually give up as these exchanges are hard on the eye ) and make their own minds up.

The content of a persons debate will be scrutinised as well as the spirit it's been delivered under, my job is to leave as open a platform as I can and I'm happy that's been achieved. Any ill you feel that's been delivered in amir s posts will stand the same scrutiny, the scrutiny of the readership.

I´m sorry, but imo it is well know that this sort of sometimes subtile, sometimes quite obvious "bullying" (might be to harsh, but i don´t know another word) can´t be handled just by hoping on informed readership. Partly for the reasons that most readers will not read back to look if something was really written/said or not.
You´ll have a hard time to find one post of amirm directed to me, that does _not_ contain an incorrect statement about something i wrote, or imagination of things i´ve never wrote or just pure imagination of things he simply can´t know, because he wasn´t there as they happened.

Of course we all can and do err, but if it constantly happens and consistently in the same direction - try to find a post where he erred in favour of one my thoughts/descriptions - it likely happens not just by accident.

Trying to fix it takes unfortunately time and posting space and, as you already noted, makes reading uncomfortable, the argumentation gets watered and basically it is nothing what helps to proceed the discussion.
Maybe it´ll help to divide the posts into a factual argumentation part and another to list/answer the erroneous/imagined attributions.

You have had a open floor, those reading will make their minds up, my request stands and in no way was it a hostile one. It's polite to introduce yourself and given the nature of these types of discourse ( technically orientated etc) advantageous for folks to know the back grounds of those they are debating .

Given your persistence you will no doubt court a certain type of response from some of the members but I'm fair minded and have no issue with you so you needent concern yourself in this regard.

Just introduce yourself, and keep these comments in mind as should all members .

I´ll give it try after consideration what i can reveal without.....
I know that often people think that way, but my stance on it was ever that arguments/facts/data stand and speak alone, everything else is belief to authority or a comparable fallacy.

The Smokester · Jul 30, 2017

I've just got one thing for you guys:

Jakob1863 · Jul 30, 2017

BE718 said:
No. I have no problem with an open mind. What I object to is you using the possibility that anything can happen as a technique to imply something is happening or has any likelihood of happening.

What should one do, if there is no data available that describes the impact of every bias mechanism on participants?
If i don´t have information about, then it is a good advice to follow Laplace´s principle of insufficient reason, which means (in short) to assign the same prior probability to all effects (1/no. of effects).

As said before, one of the golden rules follows the same approach, "block out (means of the bias effects) what you can and randomize what you can´t block out". The approach "pah, i imagine it won´t have any influence or won´t be that bad" doesn´t bode well.

Thus far you have provided no compelling evidence of...... well...anything at all. Just nebulous statements and semantics.

Is it a game where the first one shouting "semantics" wins?
Please remember the starting point of our discussion, you´ve asserted that _all_ "audiophiles" went blind, when tested under "blind" conditions. I asked three times for additional details to evaluate what happened in your/those events, but you didn´t supply some!?

Do you have any specific points to make with supporting evidence that can be scrutinised?

Sure, as stated before, getting incorrect results under "blind" conditions is as easy as it is in sighted listening tests. As "blind" overstates the importance of the "blind virtue" it is better to talk about controlled listening tests or to mention the specific protocol used.

It is well known that a plethora of bias effects still works - after some are removed due to the blinding - see for example the articles by Zielinski/Rumsey/Bech /1/ and Zielinski /2/ on various bias mechanism.

To evaluate the impact of bias effects we have to know what the correct answer should have been and compare that to the actual observed data. As said before, it is well known that in "same/different" tests participants have a very high "fail rate" if two identical stimuli were presented, a good example for that are the results from the Stereophile audio amplifier listening test /3/ .

That is something that happens not only in controlled listening tests but in other sensory tests as well, Alvary-Rodriguez et al. /4/ reported up to 83% fails (when presented the same stimuli twice in a row) with potato chips and mentioned in their article that Ennis reported nearly 80% fails when testing cigarettes.

There was even a cross-cultural study to see if these specific bias effects work to the same degree in different populations and found that they worked to different degrees. (Marchisano, C. et al. /5/)

Whilst having a closed mind is obviously not good, equally having a mind that accepts anything, so open your brain falls out on to the floor, is equally non productive.

I hope it is easier now to understand why the "pah, won´t matter much" approach will most likely fail....

/1/ Zielinski, Slawomir, Rumsey, Francis, Bech, Soren; On Some Biases Encountered in Modern Audio Quality Listening Tests—A Review, JAES Volume 56, Issue 6 pp. 427-451, June 2008

/2/ Zielinski, Slawomir; On Some Biases Encountered in Modern Audio Quality Listening Tests (Part 2): Selected Graphical Examples and Discussion, JAES Volume 64 Issue 1/2 pp. 55-74; January 2016

/3/ https://www.stereophile.com/content/blind-listening-page-4

/4/ Alfaro-Rodriguez, Hayde & Angulo, Ofelia & O'Mahony, Michael. (2007). Be your own placebo: A double paired preference test approach for establishing expected frequencies. Food Quality and Preference - FOOD QUAL PREFERENCE. 18. 353-361. 10.1016/j.foodqual.2006.02.009.

/5/ Marchisano, C., Lim, J., Chao, H. S., Suh, D. S., Jeon, S. Y., Kim, K. O., & O’Mahony, M. (2003). Consumers report preferences when they should not: A cross-cultural study. Journal of Sensory Studies, 18, 487–516.

Jakob1863 · Jul 30, 2017

amirm said:
Let me go back to this experiment of yours. You say there were two identical boxes? If so, these were not two commercial units or else they would look completely different, right?

Were both of these boxes of your build/design?

Yes and Yes.

You say this is closer to what audiophiles typically do. In what occasion do people get two identical looking boxes to evaluate for better fidelity? Don't you think they don't know at all times they are under a test microscope in such a situation?

"Closer" means "not exactly like every other comparison, but not as artifical as usual tests (often) are"; as we did something alike with these people already occasionally for evaluation purposes (i did/do the same with prototypes from them), therefore, no i don´t think that they were feeling like being under a microscope.

Is it correct that each participant only produced one vote/trial?

Yes.

amirm · Jul 30, 2017

Jakob1863 said:
"Closer" means "not exactly like every other comparison, but not as artifical as usual tests (often) are"; as we did something alike with these people already occasionally for evaluation purposes (i did/do the same with prototypes from them), therefore, no i don´t think that they were feeling like being under a microscope.

How often they are given two identical boxes and asked which one sounds better? Don't you think the very notion of giving them two identical boxes would stress them more rather than less if the boxes looked different?

Where they not told this is a test by you with identity of what differed between the boxes hidden from them?

On the single trial, do you accept that the p value = .5?

amirm · Jul 30, 2017

Jakob1863 said:
Yes and Yes.

That's good because it means you can explain to us what differed in the design of each. And post more information about their measurements. Is this data forthcoming?

amirm · Jul 30, 2017

Jakob1863 said:
It is well known that a plethora of bias effects still works - after some are removed due to the blinding - see for example the articles by Zielinski/Rumsey/Bech /1/ and Zielinski /2/ on various bias mechanism.

I want to make sure we are very clear on your position (darn near impossible from your posts).

Case 1) Audiophile listener tests box A and B and overwhelmingly says box A sounds better in sighted evaluation.

Case 2) Audiophile performs exact same test as in case (1), but this time, identity of which is being played is hidden. The test is repeated a few times, outcome is highly inconsistent resulting in p >> 0.05.

You would draw what conclusion from this data? Is box A better sounding than B or not?

amirm · Jul 30, 2017

Jakob1863 said:
It is well known that a plethora of bias effects still works - after some are removed due to the blinding - see for example the articles by Zielinski/Rumsey/Bech /1/ and Zielinski /2/ on various bias mechanism.

Trusting you have read those references, do you agree with this from Zieliński, Rumsey and Beck paper:

And this in the introduction:

March Audio · Jul 31, 2017

Jakob1863 said:
What should one do, if there is no data available that describes the impact of every bias mechanism on participants?
If i don´t have information about, then it is a good advice to follow Laplace´s principle of insufficient reason, which means (in short) to assign the same prior probability to all effects (1/no. of effects).

As said before, one of the golden rules follows the same approach, "block out (means of the bias effects) what you can and randomize what you can´t block out". The approach "pah, i imagine it won´t have any influence or won´t be that bad" doesn´t bode well.

Is it a game where the first one shouting "semantics" wins?
Please remember the starting point of our discussion, you´ve asserted that _all_ "audiophiles" went blind, when tested under "blind" conditions. I asked three times for additional details to evaluate what happened in your/those events, but you didn´t supply some!?

Sure, as stated before, getting incorrect results under "blind" conditions is as easy as it is in sighted listening tests. As "blind" overstates the importance of the "blind virtue" it is better to talk about controlled listening tests or to mention the specific protocol used.

It is well known that a plethora of bias effects still works - after some are removed due to the blinding - see for example the articles by Zielinski/Rumsey/Bech /1/ and Zielinski /2/ on various bias mechanism.

To evaluate the impact of bias effects we have to know what the correct answer should have been and compare that to the actual observed data. As said before, it is well known that in "same/different" tests participants have a very high "fail rate" if two identical stimuli were presented, a good example for that are the results from the Stereophile audio amplifier listening test /3/ .

That is something that happens not only in controlled listening tests but in other sensory tests as well, Alvary-Rodriguez et al. /4/ reported up to 83% fails (when presented the same stimuli twice in a row) with potato chips and mentioned in their article that Ennis reported nearly 80% fails when testing cigarettes.

There was even a cross-cultural study to see if these specific bias effects work to the same degree in different populations and found that they worked to different degrees. (Marchisano, C. et al. /5/)

I hope it is easier now to understand why the "pah, won´t matter much" approach will most likely fail....

/1/ Zielinski, Slawomir, Rumsey, Francis, Bech, Soren; On Some Biases Encountered in Modern Audio Quality Listening Tests—A Review, JAES Volume 56, Issue 6 pp. 427-451, June 2008

/2/ Zielinski, Slawomir; On Some Biases Encountered in Modern Audio Quality Listening Tests (Part 2): Selected Graphical Examples and Discussion, JAES Volume 64 Issue 1/2 pp. 55-74; January 2016

/3/ https://www.stereophile.com/content/blind-listening-page-4

/4/ Alfaro-Rodriguez, Hayde & Angulo, Ofelia & O'Mahony, Michael. (2007). Be your own placebo: A double paired preference test approach for establishing expected frequencies. Food Quality and Preference - FOOD QUAL PREFERENCE. 18. 353-361. 10.1016/j.foodqual.2006.02.009.

/5/ Marchisano, C., Lim, J., Chao, H. S., Suh, D. S., Jeon, S. Y., Kim, K. O., & O’Mahony, M. (2003). Consumers report preferences when they should not: A cross-cultural study. Journal of Sensory Studies, 18, 487–516.

Honestly Jakob I just don't have the energy to read through your pages of nebulous waffling to try and figure out,what point you are actually making...If any at all.

The only thing I have been able to draw so far is that you are trying to discredit blind testing by using the rationale that "anything is possible" without any evidence that something specific is happening.

March Audio · Jul 31, 2017

amirm said:
I want to make sure we are very clear on your position (darn near impossible from your posts).

Case 1) Audiophile listener tests box A and B and overwhelmingly says box A sounds better in sighted evaluation.

Case 2) Audiophile performs exact same test as in case (1), but this time, identity of which is being played is hidden. The test is repeated a few times, outcome is highly inconsistent resulting in p >> 0.05.

You would draw what conclusion from this data? Is box A better sounding than B or not?

Jakob, please answer Amirs question concisely.

Thomas savage · Jul 31, 2017

Jakob1863 said:
I´m sorry, but imo it is well know that this sort of sometimes subtile, sometimes quite obvious "bullying" (might be to harsh, but i don´t know another word) can´t be handled just by hoping on informed readership. Partly for the reasons that most readers will not read back to look if something was really written/said or not.
You´ll have a hard time to find one post of amirm directed to me, that does _not_ contain an incorrect statement about something i wrote, or imagination of things i´ve never wrote or just pure imagination of things he simply can´t know, because he wasn´t there as they happened.

Of course we all can and do err, but if it constantly happens and consistently in the same direction - try to find a post where he erred in favour of one my thoughts/descriptions - it likely happens not just by accident.

Trying to fix it takes unfortunately time and posting space and, as you already noted, makes reading uncomfortable, the argumentation gets watered and basically it is nothing what helps to proceed the discussion.
Maybe it´ll help to divide the posts into a factual argumentation part and another to list/answer the erroneous/imagined attributions.

I´ll give it try after consideration what i can reveal without.....
I know that often people think that way, but my stance on it was ever that arguments/facts/data stand and speak alone, everything else is belief to authority or a comparable fallacy.

Be concise, address the relevant detail with the relevant facts,thats if you have any ambition to be read, understood and your opinion to old purpose in the mind of anyone else but yourself.

I'm trying to help you communicate your point, i agree with a deal that you say Jacob but we aren't going to change the world on a web forum so you endlessly pointing at things won't change them. You have to adapt to communicate, this mostly involves ignoring plenty of crap and just addressing the most pertinent parts of the discussion.

Right now you have achieved the opposite of what you probably want ( and what I want for you, to be heard.. I can't help it if folks don't accept what you say though, that's not my job) , to be heard.. thats the basic point of communication and what I'm trying to tell you. There's limits to this medium of communication mate, it's great to have these places but we need to be as efficient as possible and recognise the space we are working within. This is not your fault of course, but fault and responsibility don't go hand in hand, not in the real world and unfortunately not here either.

amirm · Jul 31, 2017

Jakob1863 said:
/5/ Marchisano, C., Lim, J., Chao, H. S., Suh, D. S., Jeon, S. Y., Kim, K. O., & O’Mahony, M. (2003). Consumers report preferences when they should not: A cross-cultural study. Journal of Sensory Studies, 18, 487–516.

You miss the part that destroys your arguments. Take this part of the abstract prior to getting into cultural differences:

This is what we routinely face in high-end audio. Equipment that sounds/measures the same is perceived to be different with preference shown for one gear vs the other. You know, like your amplifier test. This is why I said you can't jump to preference when you know a priori that the outcome should be no difference. By using a control in there, you can be sure if there is a difference is perceived first, then proceed to quantify preference.

Preference tests indeed can be more difficult and challenging to determine. Much of your references dig into this aspect, not whether there is a difference at all.

As to cross cultural differences in audio, much of that appears to be a myth. To wit, Harman which has a multi-billion dollar car audio business was challenged to show that they understood the sound that Asians prefered vs Americans for a large car contract. Japanese companies were bidding against them saying they understood this better. So harman used binaural recording to capture the sound of a car and then conducted blind experiments in both continents. Results showed that there was NO preference difference. What we preferred, they also preferred in Asia! And went on to win this large contract.

Here is a shot of how the experiment was conducted:

Here are the results:

As we see, the order of preference is the same for Chinese testers and Harman American expert listeners (who are more critical but otherwise have same preferences).

A different test was used to determine if Japanese have difference preference than US. Here are the results of that:

Again, the Japanese results track that of Harman expert listeners (light and dark blue respectively).

So as you see, "we," the scientific community do know how to perform very high quality blind listening tests. And the results mean something, in this case, future of multi-billion dollar companies.

Jakob1863 · Aug 1, 2017

amirm said:
How often they are given two identical boxes and asked which one sounds better? Don't you think the very notion of giving them two identical boxes would stress them more rather than less if the boxes looked different?

No, although of course principially everything is possible

As said before, they did something similar already in the past and i evaluated prototypes for them.

Where they not told this is a test by you with identity of what differed between the boxes hidden from them?

No, they were asked to listen and to say which one they´d prefer.

On the single trial, do you accept that the p value = .5?

Yes, as said before, the p value in each trial was p = 0.5 .

Jakob1863 · Aug 1, 2017

amirm said:
That's good because it means you can explain to us what differed in the design of each. And post more information about their measurements. Is this data forthcoming?

Sure, if there is some merit to discuss further details. Please remember, that reason for debating about the audibility is the argument of measurements below the hearing thresholds. I posted the results of a pretty usual set of measurements given for preamplifiers. If that´s not enough information please specify precisely what is missing in comparison to the data usually published by manufacturers and why it would help wrt the thresholds of hearing.

Jakob1863 · Aug 1, 2017

amirm said:
I want to make sure we are very clear on your position (darn near impossible from your posts).

Case 1) Audiophile listener tests box A and B and overwhelmingly says box A sounds better in sighted evaluation.

Case 2) Audiophile performs exact same test as in case (1), but this time, identity of which is being played is hidden. The test is repeated a few times, outcome is highly inconsistent resulting in p >> 0.05.

You would draw what conclusion from this data? Is box A better sounding than B or not?

Again, we had a similar thing already in this thread and for the same reasons i now can only give the same answer: There is no conclusion possible regarding the given informations.

Jakob1863 · Aug 1, 2017

amirm said:
Trusting you have read those references, do you agree with this from Zieliński, Rumsey and Beck paper:

View attachment 8017

And this in the introduction:
View attachment 8018

Sure....

amirm · Aug 1, 2017

Jakob1863 said:
Yes, as said before, the p value in each trial was p = 0.5 .

There you go. In other words, if you had loaned me the amp, I could have just randomly voted for one of them being better without even listening to them. Any assertion then that it provided extra strength for one amp or the other being better would be false then.

This is why a proper test would have called for multiple trial as to make that test statistically strong on itself.

amirm · Aug 1, 2017

Jakob1863 said:
Again, we had a similar thing already in this thread and for the same reasons i now can only give the same answer: There is no conclusion possible regarding the given informations.

There was one and only one variable in that scenario: whether the listener knew the identity of sound or not. The fact that this variable completely changed the outcome does not allow you say anything about the reliability of case #1?

amirm · Aug 1, 2017

Jakob1863 said:
Sure....

So when you link to those papers, it is best to point out that as much as they may be talking about potential biases existing in blind tests, they outright and with strong conviction damn any sighted tests. That is what is at stake here. Your tendency for using the reference for one small purpose while ignoring the larger one seems totally illogical or else, is evidence of not having read those references. Not sure which one is worse.

Limitations of blind testing procedures

Grand Contributor

Addicted to Fun and Learning

Active Member

Addicted to Fun and Learning

Addicted to Fun and Learning

Founder/Admin

Founder/Admin

Founder/Admin

Founder/Admin

Master Contributor

Master Contributor

Grand Contributor

Founder/Admin

Addicted to Fun and Learning

Addicted to Fun and Learning

Addicted to Fun and Learning

Addicted to Fun and Learning

Founder/Admin

Founder/Admin

Founder/Admin

Similar threads