• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Double Blind Testing FAQ Development

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,376
Likes
234,546
Location
Seattle Area
OK, my fingers are tired. :) I hope others start to contribute and we have a starting point....

I will be super disappointed if this work does not conclude and members don't contribute to it.

OP is doing us a great favor by creating this doc so that we can reference in the future than constantly having to write parts of it in posts.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,267
Likes
4,759
Location
My kitchen or my listening room.
Hm, just saw this discussion (if I've seen it before, it's along time ago).

As to "taking" an ABX or ABC/hr test (BS1116), listening for differences is the way to go. The whole "two things in the head" is bollocks.

Just listen for differences between A and X or B and X. The one that ISN'T DIFFERENT is the answer.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,267
Likes
4,759
Location
My kitchen or my listening room.
Wow. I would have thought some people on ASR would have wanted to contribute their thoughts given how much people are arguing for blind testing.


Many of us are very, very tired of being vilified in the press, on the net, etc, by people who make money by running sighted tests, or just plain bad tests.

Even people who run DBT's or ABX (computer administrated) do not often run things like control trials. Such things (both negative and positive controls) are absolute necessity.
 

audio2design

Major Contributor
Joined
Nov 29, 2020
Messages
1,769
Likes
1,830
For those not familiar: BS1116 Download

BS1116 adds a 1-5 rating, to ABC testing as opposed to a hard yes/no. One of B and C will always be the same as A , and one different, just like ABX, but in ABX, the "hidden reference" comes at the end, and in ABC, the hidden references is always A. However, in a typical ABC, the hidden reference does not change, while in ABX, the hidden reference can and does change.
 
Last edited:
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
While the presentation of ABX encourages the listener to run the test this way, this is not a requirement. Indeed I find that it makes it more difficult to pass the test. A degenerate version of ABX is what I call AX testing. You listen to A and then play the presented random one. Then all you have to decide is whether they are the same or not.

Prior to running the test I listen to both A and B and try to identify the key thing that is different. Once I have that identification, then I can run the test with just one of the known stimulus.

AX is pretty much equivalent to what I suggested - 2IFC - you play two things and decide whether they are the same or different (and you could ask for preference as well). The AX version you suggest where you listen to A then play a random sample is also a bit like a match to sample paradigm, where you familiarize/learn A, then listen to a string of Xs, deciding in each case whether each X matches A. But I like that less because the X's can affect one another. So 2IFC - you play two samples and decide on same or different is the most clean I think. And just make sure you cover all possible combinations: AA, AB, BA, BB.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
One key aspect of the test is the statistical analysis as mentioned in the link in my article. So I would phrase it this way in the FAQ:

"In any forced choice of A or B, the listener can simply guess. Just like a flip of coin, if we ran the test just once, he would have 50% chance of getting lucky and guessing correctly. To get around this, enough number of trials needs to be conducted so that the chance of guessing would be small. The standard threshold for audio research is probability of guessing of 5% which is usually expressed as "p < 0.05." If you ran the test 10 times, you would need to get 9 right to achieve this metric. [We should pick the number of trails here. I suggest 15]. See https://www.audiosciencereview.com/forum/index.php?threads/statistics-of-abx-testing.170/

Note that there is nothing magical about 5% and indeed if the claim is easy of detection, one may want to target even smaller threshold to give higher confidence of results. p < 0.01 (1%) would be one such target."

Exactly. Appropriate statistical tests to ensure any measured difference over trials isn't due to random guessing/luck. I also like d-prime for this reason: https://www.cns.nyu.edu/~david/handouts/sdt/sdt.html

Green and Swets applied this directly to auditory signals. It is still the standard.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,267
Likes
4,759
Location
My kitchen or my listening room.
Ok. Some things required for a valid ABX or ABC/hr test. (The "same/different" test is no better, btw, in practice.)

1) Both signals must be same level, to within .1dB
2) Both signals must be precisely time aligned. 1 sample at 44100 is barely close enough.
3) The "blinding" must be either double (experimenter and subject) or computer-administered with carefully written programming so that you don't get cues from the computer.
4) You must be able to loop and switch at will, with minimal (10 millisecond) delay.
5) Switching must be clickless and continuous. For digital signals, windowing must be used. For analog signals, it's harder, but you have to find a way that is clickless. Looping must avoid clicks and crunches at the loop point. Again windowing.
6) Both negative (A and B the same) controls, and positive (A and B SHOULD be detectable) controls are absolutely required.
7) There is no time limit.
8) 10 trials is about it for fatigue without a rest period of at least as long as the 10 trials took.
9) Training, first with easy signals, then with harder signals, then with the trial signals, is necessary. During training, feedback must be supplied (right/wrong)
10) It is often useful to use the full ABC/hr paradigm, where you rate the "different" signal on the ITU difference scale.

Yes, this is a pain in the behind.
 

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,386
Likes
24,752
Location
Alfred, NY
Ok. Some things required for a valid ABX or ABC/hr test. (The "same/different" test is no better, btw, in practice.)

1) Both signals must be same level, to within .1dB
2) Both signals must be precisely time aligned. 1 sample at 44100 is barely close enough.
3) The "blinding" must be either double (experimenter and subject) or computer-administered with carefully written programming so that you don't get cues from the computer.
4) You must be able to loop and switch at will, with minimal (10 millisecond) delay.
5) Switching must be clickless and continuous. For digital signals, windowing must be used. For analog signals, it's harder, but you have to find a way that is clickless. Looping must avoid clicks and crunches at the loop point. Again windowing.
6) Both negative (A and B the same) controls, and positive (A and B SHOULD be detectable) controls are absolutely required.
7) There is no time limit.
8) 10 trials is about it for fatigue without a rest period of at least as long as the 10 trials took.
9) Training, first with easy signals, then with harder signals, then with the trial signals, is necessary. During training, feedback must be supplied (right/wrong)
10) It is often useful to use the full ABC/hr paradigm, where you rate the "different" signal on the ITU difference scale.

Yes, this is a pain in the behind.

Let me add something trivial, yet... so often abused: Define in advance the exact question that the experiment is designed to answer. Define in advance the criteria for (in the vernacular) pass or fail. Do not post hoc shift the question or deal with post hoc complaints that the experiment didn't measure something not specified in the question to be answered.

I take some issue with #6. Positive and negative controls are usually necessary, especially in the psychoacoustic research you've done, and for that matter, in all serious psychoacoustic research. But there's some experiments I have been involved with before where positive controls were inappropriate (e.g., testing "magic," where you can't have something known to be above audible threshold because... that crap isn't actually audible). For example, what's the positive control for the presence or absence of a magic ray generator (I swear, that's a real thing)? The counterargument I sometimes hear to that is, "Well, you can put in a ringer with, say, a 0.4dB (or whatever) level change." My issue with that is that it's a determinant of the test's sensitivity to a different phenomenon than the one under test and hence not truly a positive control. I have, of course, implicitly admitted that testing the claims of charlatans is not "serious psychoacoustic research," but it's something that needs to be done.

Magic is often easier to deal with using sorting, but that's a story for a different day. Sorting tests can be a VERY powerful tool.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,267
Likes
4,759
Location
My kitchen or my listening room.
Let me add something trivial, yet... so often abused: Define in advance the exact question that the experiment is designed to answer. Define in advance the criteria for (in the vernacular) pass or fail. Do not post hoc shift the question or deal with post hoc complaints that the experiment didn't measure something not specified in the question to be answered.

I take some issue with #6. Positive and negative controls are usually necessary, especially in the psychoacoustic research you've done, and for that matter, in all serious psychoacoustic research. But there's some experiments I have been involved with before where positive controls were inappropriate (e.g., testing "magic," where you can't have something known to be above audible threshold because... that crap isn't actually audible). For example, what's the positive control for the presence or absence of a magic ray generator (I swear, that's a real thing)? The counterargument I sometimes hear to that is, "Well, you can put in a ringer with, say, a 0.4dB (or whatever) level change." My issue with that is that it's a determinant of the test's sensitivity to a different phenomenon than the one under test and hence not truly a positive control. I have, of course, implicitly admitted that testing the claims of charlatans is not "serious psychoacoustic research," but it's something that needs to be done.

Magic is often easier to deal with using sorting, but that's a story for a different day. Sorting tests can be a VERY powerful tool.

If you can't find a positive control with the units you're using, use something else. Make ***sure*** you set a minimum sensitivity. You must.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
To defend #6, control doesn't mean AUDIBLE control, it means that there are conditions where the same exact signal ("AA"; in terms of the audio chain producing the signal) is played twice and conditions where two different signals ("AB"; in terms of the audio chain producing the signals) are played sequentially. Now your experiment might not detect any ability of listeners to distinguish the AA vs AB conditions and that means you can't reject the null hypothesis. That could be for several different reasons. It could be there is no detectable difference between A and B because there isn't a difference (at least within the bounds of human hearing). It could be that there is no detectable difference between A and B because there is sufficient noise in the experiment (from whatever noise sources you like) that any difference between A and B is swamped by the noise (this is a major issue in fMRI studies). It could be that there is no detectable difference between A and B because the experiment is underpowered (e.g., insufficient trials).

You are arguing something that is a variant of Pauli's point:

This isn't right. It's not even wrong.
— Wolfgang Pauli

But really Shermer's version of it:

In science, if an idea is not falsifiable, it is not that it is wrong, it is that we cannot determine if it is wrong, and thus it is not even wrong.
— Michael Shermer, Wronger Than Wrong in Scientific American, Nov 2006

One can't falsify "magic". But here there is always going to be something to falsify - that two different devices that serve the same function (cable, amp, etc) will sound detectably different from one another under controlled conditions. To argue that two different physical devices can't sound different and therefore the claim isn't falsifiable seems biased - it presumes we should even run the tests. And to be fair, since these devices typically measure at least slightly differently, there is some possibility that they could sound different from one another. At least we should test that.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
If you can't find a positive control with the units you're using, use something else. Make ***sure*** you set a minimum sensitivity. You must.

Yep. This is what I mean by the second possibility - there should be a low enough noise floor (and I don't just mean noise from the systems, but across the entire experiment - environmental noise, participant noise, etc.) that a difference should be detectable if present. One can use a staircase method or something like that to determine what the minimum detectable difference is a given set of conditions. Otherwise, if you get a null effect (no detected differences) you can't know that the differences aren't there - it could just be they weren't detectable for the noise levels in your experiment.
 

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,386
Likes
24,752
Location
Alfred, NY
To defend #6, control doesn't mean AUDIBLE control, it means that there are conditions where the same exact signal ("AA"; in terms of the audio chain producing the signal) is played twice and conditions where two different signals ("AB"; in terms of the audio chain producing the signals) are played sequentially. Now your experiment might not detect any ability of listeners to distinguish the AA vs AB conditions and that means you can't reject the null hypothesis. That could be for several different reasons. It could be there is no detectable difference between A and B because there isn't a difference (at least within the bounds of human hearing). It could be that there is no detectable difference between A and B because there is sufficient noise in the experiment (from whatever noise sources you like) that any difference between A and B is swamped by the noise (this is a major issue in fMRI studies). It could be that there is no detectable difference between A and B because the experiment is underpowered (e.g., insufficient trials).

You are arguing something that is a variant of Pauli's point:

This isn't right. It's not even wrong.
— Wolfgang Pauli

But really Shermer's version of it:

In science, if an idea is not falsifiable, it is not that it is wrong, it is that we cannot determine if it is wrong, and thus it is not even wrong.
— Michael Shermer, Wronger Than Wrong in Scientific American, Nov 2006

One can't falsify "magic". But here there is always going to be something to falsify - that two different devices that serve the same function (cable, amp, etc) will sound detectably different from one another under controlled conditions. To argue that two different physical devices can't sound different and therefore the claim isn't falsifiable seems biased - it presumes we should even run the tests. And to be fair, since these devices typically measure at least slightly differently, there is some possibility that they could sound different from one another. At least we should test that.
A positive control is necessary where the phenomenon is known to be real. If this is not the case, then any choice of positive control will by definition be a different phenomenon than that under test. For example, in my previous life, I researched the presence of xenoestrogens in plastic leachates by testing the effects on estrogen receptors. Negative control was easy, something that contained no xenoestrogen. Positive control was easy, something containing a known xenoestrogen, BPA. But if I used (say) a pH shift to do positive controls rather than a known xenoestrogen, I would have difficulty justifying that my test had adequate sensitivity for detecting xenoestrogens- it only tells me my test's sensitivity to pH shifts, NOT the question being asked.

What positive control would be used in testing the magic ray generator with claimed audible effects? There IS no known ray level to test the experimental sensitivity. My control protocol would therefore be quite different and appropriate to the specific question being asked. Yes, different than the control protocols one would use to determine, say, the audibility of crossover distortion.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,267
Likes
4,759
Location
My kitchen or my listening room.
What positive control would be used in testing the magic ray generator with claimed audible effects? There IS no known ray level to test the experimental sensitivity. My control protocol would therefore be quite different and appropriate to the specific question being asked. Yes, different than the control protocols one would use to determine, say, the audibility of crossover distortion.

You have several options. Level difference of .2dB or so should show up. Small frequency shaping, added noise floor, there are lots of things you could do.

Of course they are not the same as a fantasy do-nothing, but there's nothing to be done about that. Show that the test has sensitivity.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
To reiterate what J_J says, one just should demonstrate that differences are detectable under your experimental design. Otherwise the entire thing could be cooked to never find a difference (or accused of being cooked). I was once involved in a long-running debate where the other group predicted and found null effects (bad). I ended up having to explore why their experimental design couldn't detect any effects - it was going to be null no matter what. We don't want that. So we need to have as part of the FAQ: "Demonstrate your experimental setup/design has sensitivity and differences are detectable by listeners when they SHOULD be detectable" (e.g. an intentional level difference). Find the minimal value that every listener can detect. Because of differences in hearing due to noise exposure, age, ear wax, etc, auditory researcher typically start each experiment by calibrating levels to ensure that each participant is set - for the positive control - at the same level of detectability. Then that level is used for the rest of the experiment.
 

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,386
Likes
24,752
Location
Alfred, NY
You have several options. Level difference of .2dB or so should show up. Small frequency shaping, added noise floor, there are lots of things you could do.

Of course they are not the same as a fantasy do-nothing, but there's nothing to be done about that. Show that the test has sensitivity.

But that's exactly the problem I pointed out- the control varies something that is not being tested. Sensitivity to Variable A is not interchangeable with sensitivity to Variable B.
 

solderdude

Grand Contributor
Joined
Jul 21, 2018
Messages
15,891
Likes
35,912
Location
The Neitherlands
Setting up a proper blind test is a LOT of work. Especially when testing digital components it may be hard to match levels. Also channel imbalance should be looked at. Pots can easily vary 0.1dB within the 'normal usable range'.

If one's life depends on it, one's reputation or a lot of money I think criteria are more critical than say someone trying to find out which device is 'better' (preferred or objectively better).

What often is asked here is if proper controls were used when someone makes a claim. The answer often is the test was done blinded.
The real problem is how 'blinded' and how was the test done.

The moment one has to prove their abilities (be them special or not) is where things get ugly because let's face it....
Applying the proper controls (to prove something) is complex, not easy to do and above all requires verification of the test itself as well as enough time. I usually get 'tired' after a while listening with full attention.

Those that actually have done proper blind tests (not many I reckon) will know how much effort it takes and what challenges there are in a practical sense.

Most that make certain claims can't be bothered or have no clue how to use proper controls and certainly do not have the needed gear/people.
This means that while 'we' are right to ask for the controls used you can't expect these folks to apply all of the needed controls correctly just so 'they' can prove to 'us' that they have a preference that may be (mostly is) caused by bias. They just have a preference and defend it.

I am 100% convinced that if I were to do a blind test 'live' on camera to prove something to the world I can easily do this and rig the test without anyone knowing it.

Maybe... just maybe the best way after all is the 'uh-huh' in reply. The ONLY way to find out 'the truth' is to have the test properly done and witnessed by someone that knows how to properly test and can verify the test. All other ways can not be used as definitive proof.

To properly test DAC's, pre-amps, amplifiers, headphones, speakers you need different setups and test gear anyway that may or may not have to be changed in certain circumstances.

Those that have actually done proper blind testing know how difficult and easy they can be and what's involved. Those making claims of audibility often are clueless and cannot possible be expected to know what controls need to be used in what circumstances.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,267
Likes
4,759
Location
My kitchen or my listening room.
But that's exactly the problem I pointed out- the control varies something that is not being tested. Sensitivity to Variable A is not interchangeable with sensitivity to Variable B.

No, but there are known limits to hearing. Show you can get near them.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
But that's exactly the problem I pointed out- the control varies something that is not being tested. Sensitivity to Variable A is not interchangeable with sensitivity to Variable B.

I think there is a misunderstanding in the use of the term "control". There are two types of controls in many perceptual experiments. So lets separate them. First, one needs to demonstrate test sensitivity - the ability to detect an effect if there is an effect. In some cases the conditions are so obviously different one need not do this first. Moreover, almost all published scientific results are positive effects - where a difference was detected - Scientific journals/Scientists do not like null effects (failure to detect a difference). This is part of what is sometimes referred to as the "file drawer effect" - that there are many null results that never see the light of day. So if experiments are biased towards finding an effect (any effect!) then establishing test sensitivity isn't a big deal. But in hearing research (or lower-level vision research) one needs to establish that the observer can actually detect a difference in the paradigm one is using. So one might run a calibration block of trials to establish this - for example, establishing the lowest amplitude at which that individual can detect a difference between two signals with small amplitude differences. Due to differences in hearing (again, age, noise exposure, ear wax, etc.) one person might be able to detect differences at a low level (say a young person) but others might be unable to detect a difference at the same level (say old people). So one Amir could set levels for an experiment with his son listening, then listen himself and bring in his buddies and none of them might hear a difference - not because there might not be a difference but because these old guys can't detect differences at the same low level his son can hear them! So run a "control" by establishing the threshold for each observer.

The second kind of control is to ensure that observers aren't biased/guessing etc. For example, if one is interested in whether there are detectable differences between A and B, one should run AB and BA trials, but also AA and BB trials. If the observer claims to detect differences for AA and BB (false alarms), then this is taken into account in the analyses (again d-prime and inferential statistics). One is controlling for certain behaviors of one's participants.

But there are lots of other "controls" one must include. For example, ensuring the conditions A and B are matched in every way but for the specific property being tested (e.g., ensure they produce the same loudness at the listener, that one doesn't have clicks while the other does, etc.). So the word control is a bit omnibus. Control for listener differences (calibration as above), control for listener behavior, and so on. Most of these controls ARE NOT the dimension being tested - but they could influence the outcome if they were unintentionally not controlled.
 

solderdude

Grand Contributor
Joined
Jul 21, 2018
Messages
15,891
Likes
35,912
Location
The Neitherlands
Does it really matter what the hearing limits are and how they compare to others when it is merely to check if someone can hear a to them relevant difference ?
In the end the results will show if the differences (they are always there unless AA or BB) are detectable for that person. It does not matter if person B can or cannot ace the test. That only matters if some absolute test is done where one wants to detect hearing limits for specific kinds of (isolated) changes.

XY trials where both X and Y can be A or B would be preferred or a mix of ABX + BAX ?
Combined with statistically enough attempts is the answer for this the test sequence ?

I am more interested in HOW the level matching and tests are going to be checked.
I would suggest a good ADC with minimal requirements (24/96 or 24/192).
When absolute voltages have to be measured the cheapest possible AC voltmeter with sufficient ranges and frequency response ?

Also procedures for testing various devices (MC/MM amps, RIAA amps, RtoR, pre-amplifiers, headphone amplifiers and their loads, speaker amps, speaker loads, levels used, dummy loads, test signals, digital and analog, steady state / music, which music ? which recordings.
What test cables, switch boxes, software is to be used. DIY or off the shelf test equipment and what specifications ?
 

pozz

Слава Україні
Forum Donor
Editor
Joined
May 21, 2019
Messages
4,036
Likes
6,827
Does it really matter what the hearing limits are and how they compare to others when it is merely to check if someone can hear a to them relevant difference ?
In the end the results will show if the differences (they are always there unless AA or BB) are detectable for that person. It does not matter if person B can or cannot ace the test. That only matters if some absolute test is done where one wants to detect hearing limits for specific kinds of (isolated) changes.
Say there is a control portion of the test which takes A vs. A at a different level. first +2dB with a broadband signal, then decreasing the difference until the subject fails or votes "same". If A vs. B are level matched to 0.1dB and the difference between them is less than that, then the control will have established why the listener failed the trial, or if the listener passed, that it was not amplitude difference which the listener heard.
 
Top Bottom