• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Double Blind Testing FAQ Development

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
Hi all, in a now shut down thread (for other reasons), I suggested that it might be nice to develop community FAQ on standards for double blind testing of components. That is, a guide that lays out some basic principles (e.g., same signal path EXCEPT for the two components being compared, both rater and "switcher" blind to which component is current active, etc.) as a FAQ that we can point to. Might include numbers of trials, appropriate statistical tests, etc. I thought I would start this thread to garner suggestions. One hopes (maybe not?) that we can converge on a set of mutually-agreed conditions that can then be synthesized into a FAQ. Amir did point me to: the "bible of blind testing", ITU BS1116. https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1116-1-199710-S!!PDF-E.pdf However, it is pretty cumbersome and runs 20+ pages. I am hoping we could synthesize the critical principles in a page or two.

So open to principles to consider including in a FAQ.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
Wow. I would have thought some people on ASR would have wanted to contribute their thoughts given how much people are arguing for blind testing. Posters are ripped when they report differences without using double blind testing. Isn't the least we can do is develop some simple guidelines we consider acceptable? And to be clear (and maybe start things off), I am not a fan of ABX testing - requires holding two items in auditory memory (or whatever modality is being tested) for comparison against the target. I prefer 2IFC (two interval, forced choice) where "choice" can just be same or different. One can then run conditions A vs. B with trial structures: AB, BA, AA, BB. Easy to compute hits (correctly identifying a trial as AB or BA) vs. false alarms (reporting difference for AA or BB) and then computing d-prime and bias, as well as significance tests. The trial design is also cleaner than ABX, where you really should run: AB[A|B], BA[A|B], AA[A|B], BB[A|B] - so twice as many trial types.
 

DVDdoug

Major Contributor
Joined
May 27, 2021
Messages
2,916
Likes
3,831
Level matching is important (unless that's what you're listening for).

ABX only tells you if you can (statistically) hear a difference. If you hear a difference the test doesn't tell you which one is "better".

HydrogenAudio has some information about ABX testing but the link to the probability table seems to be broken and seems to be permanently gone. I haven't searched for an alternative and it's been too long since my statistics class. ;)

HydrogenAudio is sort-of dedicated to blind listening tests... i.e. You are not allowed to post measurements showing that "A is better than B". First, you have to demonstrate that you can actually hear what you're claiming to hear.

Most of the HydrogenAudio information is about comparing different digital formats (different lossy codecs, or lossy-to-lossless, HD to CD, etc.) where hardware switching isn't needed and it can all be done in software. With the software taking care of the randomizing & switching so you don't need two people to make it double-blind. The same thing could be accomplished with one person and an automated-hardware ABX switching box, but if such a thing exists I'm sure it's expensive.
 
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114
Thanks! With 2IFC, one could collect preferences as well. There is a ton of double blind listening work on codecs - some done with the original MP3 format onwards. I am hoping we could generate something more generic that helps facilitate careful testing of sources, amplifiers, preamps, DACs etc. And yes, level matching is a critical part of any such design. Ideally we could explain that and provide a link to a low-cost sound meter solution so that one could ensure level matching in their testing.
 

danadam

Addicted to Fun and Learning
Joined
Jan 20, 2017
Messages
956
Likes
1,496
HydrogenAudio has some information about ABX testing but the link to the probability table seems to be broken and seems to be permanently gone.
You mean that bino_dist.zip? Here: https://web.archive.org/web/20070101102152/http://www.kikeg.arrakis.es/winabx/bino_dist.zip . Also in attachment.
Also also, you could use online calculator instead, e.g. https://stattrek.com/online-calculator/binomial.aspx or https://homepage.divms.uiowa.edu/~mbognar/applets/bin.html
 

Attachments

  • bino_dist.zip
    23.8 KB · Views: 80
OP
C

CMOT

Active Member
Joined
Feb 21, 2021
Messages
147
Likes
114

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
And to be clear (and maybe start things off), I am not a fan of ABX testing - requires holding two items in auditory memory (or whatever modality is being tested) for comparison against the target.
While the presentation of ABX encourages the listener to run the test this way, this is not a requirement. Indeed I find that it makes it more difficult to pass the test. A degenerate version of ABX is what I call AX testing. You listen to A and then play the presented random one. Then all you have to decide is whether they are the same or not.

Prior to running the test I listen to both A and B and try to identify the key thing that is different. Once I have that identification, then I can run the test with just one of the known stimulus.
 

danadam

Addicted to Fun and Learning
Joined
Jan 20, 2017
Messages
956
Likes
1,496
I am not a fan of ABX testing - requires holding two items in auditory memory (or whatever modality is being tested) for comparison against the target.
A copy of my comment from https://www.audiosciencereview.com/forum/index.php?threads/abx-testing.21621/#post-718226 :

That sounds like the original procedure from 1950: Munson; Gardner. "Standardizing Auditory Tests". The Journal of the Acoustical Society of America.

Then there is a modification, apparently called "modern ABX", from 1982: Clark. "High-Resolution Subjective Testing Using a Double-Blind Comparator". Journal of the Audio Engineering Society. As far as I understand, it gives control to the participant to play A, B or X as they see fit.

Here's @j_j's comment from HA:
I have to point out that the "abx" test of Munson was a sequential test, not a time-proximate test in which the subject had control over stimulus selection at all points in time. This test has been brought up from time to time, and various people have taken upon themselves to scold the entire AES about how this sequential test is such a bad test (as it is, the lack of time proximity is massively desensitizing), ignoring the fact that this is NOT the modern test called "ABX".

The ABX test from the Michigan bunch was time-proximate. This is a substantial improvement, and is what is commonly referred to in the present day as the ABX test.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
One key aspect of the test is the statistical analysis as mentioned in the link in my article. So I would phrase it this way in the FAQ:

"In any forced choice of A or B, the listener can simply guess. Just like a flip of coin, if we ran the test just once, he would have 50% chance of getting lucky and guessing correctly. To get around this, enough number of trials needs to be conducted so that the chance of guessing would be small. The standard threshold for audio research is probability of guessing of 5% which is usually expressed as "p < 0.05." If you ran the test 10 times, you would need to get 9 right to achieve this metric. [We should pick the number of trails here. I suggest 15]. See https://www.audiosciencereview.com/forum/index.php?threads/statistics-of-abx-testing.170/

Note that there is nothing magical about 5% and indeed if the claim is easy of detection, one may want to target even smaller threshold to give higher confidence of results. p < 0.01 (1%) would be one such target."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
How about this general intro:

"The purpose of a blind test is to have a controlled evaluation where only the tester's hearing is the determinant factor, not prior knowledge of products under test, how they look, how much they cost, reputation, etc. For this reason, all non-auditory aspects need to be removed in addition to removing any signs of "tells" which would reveal the identity of devices being tested."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
On using ABX testing:

"While audiophiles generally claim one device to sound better than the other, determining this can be challenging. A first pass-filter to verify or deny such a claim is to see if any difference is able to be detected at all. Should there be no difference, then the claim of audio improvement is also invalidated. If not, then secondary tests of preference needs to be run. The most common test of difference is ABX. Here, the listener is presented with both A and B samples and allowed as much time as necessary to become familiar with their audible difference (if any). Then the listener starts the test where he/she is presented with a random version of A or B called "X." His/her job is then to determine if "X" sounds the same as A or B. Once there, he/she selects the appropriate button for that choice and a new trial is started with another random assignment of X (which could be the same or different than last test."
 

somebodyelse

Major Contributor
Joined
Dec 5, 2018
Messages
3,682
Likes
2,959
It ought to start with a series of questions about why each of the controls is necessary, with a demonstration where possible, to help people past the "Why do I have to do this when the difference is so obvious?" barrier, or the idea that we may think they're dishonest rather than simply human.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
On time limit:

"One misconception of blind testing is that there is a time limit to the testing which heightens the "stress" of passing such tests. This is not so. There is no clock in such tests. Tester can spend as little or as much time as he/she fees necessary to complete each round of the trial. While it has been shown that "long term" listening hinders rather than help such testing, it is still up to tester to choose a method to express the difference he/she hears."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
On level matching:

"Matching how loud each sample plays is a key consideration in controlled testing. If levels are not matched, it becomes trivial to tell the products apart and defeat the nature of "blind" testing. In addition, louder samples tend to sound better to listeners even if the fidelity has not changed. For these reasons, matching levels is an extremely critical aspect of such testing.

For electronic products that have flat frequency response in audible band, such a job is easy as one can pick any frequency to measure and adjust the gain/volume of each device (or music file) to be the same. If a multi-meter is used for hardware level matching, care needs to be taken to pick a low enough frequency that is within the range of the capability of the device (usually below 1 kHz). Avoid frequencies that are close to mains (50 or 60 Hz). The target for matching levels is 0.1 dB across the full audible range although this is more strict than research tends to indicate.

If the frequency response of the device is not flat, e.g. in the case of speakers, headphones, tube amps with high impedance, etc. then the job becomes challenging. Usually pink noise with some averaging is used to determine the level. While this is accepted as the standard in research, it does reduce accuracy of said work [Amir's opinion :) ].

Matching of levels for speakers also requires picking a playback level. Standard in research is 85 dBSPL at listening position [please verify from ITU BS1116 paper] for a 2-channel system."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
On switching Apparatus:

"When testing two pieces of hardware against each other, some mechanism to select between them is needed. It is very important that such selection switch/relay mechanism be either absolutely silent or lack any "tells" such as a different sound when A is selected versus B.

In addition, care must be taken that equipment is not damaged by feeding output of one into the other, or cause inductive spikes (e.g. switching out an amp from a speaker). This can present a big challenge at times as such switching hardware is not readility available.

One technique used is to capture the output of the device into a file and then perform blind testing on that. To the extent this is done, then sampling of the signal should be done in the way the device is used. So if we are testing amplifiers, then the amplifier should be connected to its speaker load. The digital capture can then be prior to the speaker or post (using microphone).

Such virtualization of testing can result in some criticism of the methodology. Still, if the claim to be verified that the differences are big, then this mechanism an survive such restrictions. In other cases though such as assessing soundstage of speakers, the capture mechanism may be insufficient to present a credible test."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
Maximization of Detection:

"It is important that the listener be given every chance to find audible differences that exist between devices. As such, prior training to become familiar with the test material/protocol is highly encouraged. As is feedback during the testing as to whether a detectable difference has been found."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
Single vs Double Blind Testing:

"It goes without saying that any controlled testing "blinds" the listener so that he/she doesn't know the identity of items under test. This is called single blind testing. To the extent there is another person/computer is used to switch the samples, ideally such mechanism is also blind in that it doesn't know the identity of what is being tested either. This is called double blind testing. With a computer program generating random file samples to play, double blind testing can be trivially achieved and must be the standard for such testing.

With respect to hardware switching, automated mechanisms to randomly select samples may not exist. To the extent a human is selected to do this job, he/she can intentionally and unintentionally telegraph signals to the listener that would distort the test. See "clever Hans:" https://simple.wikipedia.org/wiki/C... (in German, der,people who were watching him.

So ideally, the person doing the switching doesn't know the identity of item under test either. If this cannot be achieved, an approximation can be used by making absolutely sure that there is no communication whatsoever between this person ("proctor") and the person taking the test. "
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
On importance of statistical analysis:

"Any test that lacks statistical analysis to demonstrate high chance of results being correct is of no value. Saying "I have run this test blind and I could tell which is which" is of no value. You could have easily just gotten lucky with a guess. Even repeating a test 3 or 4 times is insufficient. Per other item in this FAQ, the test needs to be run significant number of times to have value (minimum of 10 times)."
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,384
Location
Seattle Area
On proof of existence:

"As much as one to trust claims of passing controlled tests, trust by itself cannot be a factor that validates a test. Some clear evidence needs to be presented that such a test was run correctly and the conditions in this FAQ are met. ABX programs which provide signed results are good first steps. Saying that you have run such a test, is not. In extreme cases, the tests need to be conducted in front of others or lacking that, at least video evidence of the same may be necessary."

[this needs more work]
 
Top Bottom