AES Paper Digest: Sensitivity and Reliability of ABX Blind Testing

amirm · Mar 22, 2016

This is abbreviated synopsis of the paper, Ten Years of ABX Testing, by David Clark: http://www.aes.org/e-lib/browse.cfm?elib=5549

ABX testing as you may know is a double blind testing methodology aimed at determining if someone can reliably identify A from B. Both A and B are presented to the user together with "X" which is either A or B. Your job is to say which one it is. This is a "forced choice" test in that you must select one of the two answers (different from preference test where you could assign a fidelity number). Given this, you could easily pick A or B using random guesses. Statistical analysis is performed then to determine the likelihood of you doing this based on the number you got right.

Back to the paper, I am focusing on the second part of the paper given the fact that the topic has come up and there has been no references of it online that I know of. The first part is a resume for ABX testing recounting all the tests that have been performed.

Clark sets out to show the efficacy of ABX using two separate tests:

TESTING THE DOUBLE-BLIND TEST

The sensitivity of the A/B/X test can be tested by comparing it to a long-term listening session with infrequent switching and low stress. Audio magazine encouraged the present author and Lawrence L. Greenhill to undertake such a comparison in 1984. Unfortunately, the results were never published. The experiment used a fixed detection task of identifying whether or not the audio was passed through a nonlinear circuit which generated 2.5% total harmonic distortion on a sine wave. The nonlinearity used called "Grunge") generated a constant distortion, independent of sine wave amplitude or frequency over a wide range. High amounts of the effect produce an annoying "garbled" sound on complex program material. The circuit is described in reference [1].

Two groups of audiophiles were used as subjects. Lawrence Greenhill's Long Island based, The Audiophile Society (TAS) provided the high-end oriented "golden ears." David Clark's Southeastern Michigan Woofer and Tweeter Marching Society (SMWTMS) provided the "engineers." Two sets of tests were to be run with each group. The first test was a group double-blind test of 16 trials comparing the 2.5% distorted signal to a bypass. As it turned out, the TAS group refused to have the signal passed through the relays and connectors of the ABX Comparator. A manually-patched 16-trial pair-comparison test was used instead. They listened to a very expensive sound system which was familiar to most of them. The SMWTMS group used A/B/X testing and an unfamiliar sound system and room. They were given a one-hour familiarization period before the test began.

The second of the tests consisted of ten battery powered black boxes, five of which had the distortion circuit and five of which did not. The sealed boxes appeared identical and were built to golden ear standards with gold connectors, silver solder and buss-bar bypass wiring. Precautions were taken to prevent accidental or casual identification of the distortion by using the on/off switch or by letting the battery run down. The boxes were handed out in a double-blind manner to at least 16 members of each group with instructions to patch them into the tape loop of their home preamplifier for as long as they needed to decide whether the box was neutral or not. This was an attempt to duplicate the long-term listening evaluation favored by golden ears.

So summarizing a distortion generating device was created and handed out to two groups to identify against a dummy device, or pass through with no modification of audio signal as such. Testers were allowed to take the equipment home and evaluate on their own audio system. Testing was done both with quick switching in ABX versus long term evaluation using "take home" version of the same.

This was the outcome:

The results were that the Long Island group [Audiophile/Take Home Group] was unable to identify the distortion in either of their tests. SMWTMS's listeners also failed the "take home" test scoring 11 correct out of 18 which fails to be significant at the 5% confidence level. However, using the A/B/X test, the SMWTMS not only proved audibility of the distortion within 45 minutes, but they went on to correctly identify a lower amount. The A/B/X test was proven to be more sensitive than long-term listening for this task.

So the audiophile group failed to identify the correct box despite the gross amount of distortion that was inserted in the loop and extended evaluation time they had. The ABX believer group using a system they did not know managed in quick order tell the difference between the distortions inserted in the audio path versus not. Not only that, they were able to repeat that by detecting even smaller amount of distortion inserted in the path.

All of this matches my personal experience 100%. The ABX test group had the benefit of training and quick switching. Both of these improve the ability to hear small differences. In the countless tests of small differences I have passed in ABX testing, I would easily fail to do so if you made the test "long term."

Science of our hearing system backs this completely. Our hearing system has a short term and long term recall. Short-term recall is almost like a tape recorder, capturing everything from our ears. That is huge amount of data so the brain applies a massive, lossy filter to what is in short term memory, and commits what is left over to long term memory.

Short-term memory only lasts seconds and is re-written. As such you need to hear both of the stimulus in that short amount of time and analyze them before they fade away. Waiting longer means relying on long term memory which has no ability to remember fine details.

Training helps by optimizing usage of short-term memory by eliminating what doesn't matter.

Summary
All in all, the position of audio science on this matter is clear: fast AB switching is far more revealing than any long term tests. No evidence has ever been presented to show otherwise or to demonstrate anything based on psychoacoustics why that would be so.

Johnny2Bad · Feb 13, 2018

Dragging up an old thread here ...

RE:
" ... No evidence has ever been presented to show otherwise or to demonstrate anything based on psychoacoustics why that would be so. ..."

I think I understand what is meant by:

" ... No evidence has ever been presented to show otherwise ...".

I am not sure whether I understand what is meant by:

" ... No evidence has ever been presented ... to demonstrate anything based on psychoacoustics why that would be so. ..."

Amir, can you clarify?

amirm · Feb 13, 2018

It is a two part statement:

1. There is no report of any controlled testing that shows slow switching is better than fast.
2. There is no psychoacoustics basis for slow switching being better than fast.

Is this what you were asking?

Cosmik · Feb 13, 2018

So A/B/X identifies "difference". Don't the audiophiles claim that the long term thing is more about "preference"?

I can't get excited about people being able to hear "difference" because difference can be measured easily without using those messy, unreliable humans.

The implication is always that listening tests can settle questions of 'better', but when push comes to shove they never can. They can't even tell you whether high res is better than CD (maybe the higher frequencies cause IMD and sound worse), or whether vinyl performs some magic that makes things sound better than digital, or whether a DAC with glitches sounds more musical than one without.

amirm · Feb 13, 2018

Cosmik said:
So A/B/X identifies "difference". Don't the audiophiles claim that the long term thing is more about "preference"?

The problem is that in long term blind testing they cannot reliably tell the products apart. If so, then the preference expressed is not reliable either.

sergeauckland · Feb 13, 2018

amirm said:
This is abbreviated synopsis of the paper, Ten Years of ABX Testing, by David Clark: http://www.aes.org/e-lib/browse.cfm?elib=5549

Summary
All in all, the position of audio science on this matter is clear: fast AB switching is far more revealing than any long term tests. No evidence has ever been presented to show otherwise or to demonstrate anything based on psychoacoustics why that would be so.

This has always been my position. To the point that I find even ABX testing difficult, given that by the time I've compared BX, I've forgotten what AX was like.

My preferred testing is AA, AB, BB, BA, where the test is same/different. Then, rapid switching very quickly and clearly shows up even small differences between A and B. What it does require is very accurate level matching, as even comparing A with and identical B, if there's a small level difference, they show up as different. When I've done these tests, I've matched levels to 0.1dB, (about a needle's width on my meter) which I suggest is good enough.

S.

Cosmik · Feb 13, 2018

Such threads seem to me to be conspicuously light on the actual insights that were gained from the testing...

The methodology is a triumph within its own frame of reference, no doubt. Yes, A/B/X is more sensitive than many other protocols that ask middle aged men what they can hear. But is it more sensitive than a cheap USB module/microphone and some freeware, at detecting distortion or frequency response errors, or noise? If not, what's the point?

amirm · Feb 9, 2019

I forgot to post the link to ITU side for BS1116: https://www.itu.int/rec/R-REC-BS.1116/_page.print

JohnYang1997 · Feb 10, 2019

I don't feel the need of double blind test. A normal blinded ab test is more than enough to my experience.

amirm · Feb 10, 2019

JohnYang1997 said:
I don't feel the need of double blind test. A normal blinded ab test is more than enough to my experience.

That is what I do too in interest of time and effort.

flipflop · Feb 10, 2019

JohnYang1997 said:
I don't feel the need of double blind test. A normal blinded ab test is more than enough to my experience.

The word 'double' refers to the test subject(s) and the test conductor(s).
In a "normal blinded" test, only the subject would be blind and the conductor could consciously or subconsciously influence the subject.

mitchco · Feb 10, 2019

Concur with the summary, which has been my experience as well. I have spent my fair share of time with foobar2000 ABX comparator over the years and considering this ABX comparator for reviews...

If looking to try your ears/brain at this, Archimago currently has a fun test:
INTERNET BLIND TEST: Do digital audio players sound different? (Playing 16/44.1 music.)

DonH56 · Feb 11, 2019

mitchco said:
Concur with the summary, which has been my experience as well. I have spent my fair share of time with foobar2000 ABX comparator over the years and considering this ABX comparator for reviews...

If looking to try your ears/brain at this, Archimago currently has a fun test:
INTERNET BLIND TEST: Do digital audio players sound different? (Playing 16/44.1 music.)

I spoke with Frank ages ago as a snot-nosed kid... Great guy and not shy in his low tolerance for B.S. Really great to see he and his company are still around!

andreasmaaan · Feb 11, 2019

Cosmik said:
Such threads seem to me to be conspicuously light on the actual insights that were gained from the testing...

The methodology is a triumph within its own frame of reference, no doubt. Yes, A/B/X is more sensitive than many other protocols that ask middle aged men what they can hear. But is it more sensitive than a cheap USB module/microphone and some freeware, at detecting distortion or frequency response errors, or noise? If not, what's the point?

To establish at what thresholds those things that can be measured can be heard.

I’m sure you’d agree that nonlinear distortion say 200dB below the signal can’t be heard?

If so, what do you base that belief on?

What about 150dB? Or 100dB, or 80dB? At some point I’m sure you’ll say “that might be audible”.

If you you’re interested in whether something is audible, what’s the best way to find out?

Or are you just not interested in how audio actually sounds?

jhaider · Feb 11, 2019

amirm said:
This is abbreviated synopsis of the paper, Ten Years of ABX Testing, by David Clark: http://www.aes.org/e-lib/browse.cfm?elib=5549

David Clark passed away late last year. I never had a chance to communicate with him, but he's been an influence for sure. RIP

Cosmik · Feb 11, 2019

andreasmaaan said:
To establish at what thresholds those things that can be measured can be heard.

I’m sure you’d agree that nonlinear distortion say 200dB below the signal can’t be heard?

If so, what do you base that belief on?

What about 150dB? Or 100dB, or 80dB? At some point I’m sure you’ll say “that might be audible”.

If you you’re interested in whether something is audible, what’s the best way to find out?

Or are you just not interested in how audio actually sounds?

If you can test to find out what the thresholds are, you already have the hardware that exceeds the performance required to be transparent. So why not just use that hardware all the time, anyway?!

Theo · Feb 11, 2019

Cosmik said:
So A/B/X identifies "difference". Don't the audiophiles claim that the long term thing is more about "preference"?

Agreed. However, to be able to have a legitimate preference, you need to be able to identify a difference. So, if you can't tell which is which in an ABX test, your preference is BS :facepalm:

, unless you state that it has nothing to do with sound.

We both leave in a free country, so let them have a preference, whichever it is...

as long as they don't try to bias innocent bystanders

sergeauckland · Feb 11, 2019

Theo said:
Agreed. However, to be able to have a legitimate preference, you need to be able to identify a difference. So, if you can't tell which is which in an ABX test, your preference is BS, unless you state that it has nothing to do with sound. We both leave in a free country, so let them have a preference, whichever it is... as long as they don't try to bias innocent bystanders

Exactly this. We can all have preferences based on looks, how it makes us feel, bragging rights, or perhaps more 'objective' subjective measures such as serviceability, quality of construction or manufacturer's reputation. However, so many audiophiles claim their preferences are entirely based on sound, yet refuse to accept the validity of blind testing as they can't tell any difference blind, when it's obvious sighted, so the testing is wrong.

S.

andreasmaaan · Feb 11, 2019

Cosmik said:
If you can test to find out what the thresholds are, you already have the hardware that exceeds the performance required to be transparent. So why not just use that hardware all the time, anyway?!

Because optimum performance on one system may not be possible on another.

For example

a speaker at a distance of 1m is capable of lower distortion per decibel than a speaker at a distance of 10m or 100m
a speaker required to fit inside a home or studio isn't capable of as low distortion as a speaker 4x or 20x the size (or a pair of headphones)
a speaker which is not limited by the need for low latency is capable of superior phase response to a system of which low latency is required

Those are just the first three examples that come to mind.

And of course, because of cost.

And you still haven't answered my question: if I know that X is transparent, why use (better-measuring) Y?

Cosmik · Feb 12, 2019

andreasmaaan said:
Because optimum performance on one system may not be possible on another.

For example

a speaker at a distance of 1m is capable of lower distortion per decibel than a speaker at a distance of 10m or 100m

a speaker required to fit inside a home or studio isn't capable of as low distortion as a speaker 4x or 20x the size (or a pair of headphones)

a speaker which is not limited by the need for low latency is capable of superior phase response to a system of which low latency is required

Those are just the first three examples that come to mind.

And of course, because of cost.

And you still haven't answered my question: if I know that X is transparent, why use (better-measuring) Y?

Are you not just going to use the best or most appropriate technology you can obtain/design/afford in each circumstance? And by 'best' I mean objectively best - which you must be able to measure/characterise/specify in order to do the science, anyway.

Pretty soon I think we (as a family) are going to be getting rid of our old upright piano and buying a digital one. The manufacturer could try to tempt us by telling us that science has shown that the amplifiers within it are audibly transparent, but I don't think it would be a major selling point. What I expect is that the amplifiers will be more than adequate to the task, and that even if ten times the amount had been spent on them, they wouldn't sound any different in 99.9% of circumstances. The colour of the keys or the shape of the on-off switch will be of more significance to us than the amps.

I think that the amps for monitoring in studios have similar utilitarian requirements, as do the amps for PA.

Quad amps, designed by Peter Walker (mentioned earlier, who aimed for objective performance without even listening to the results) have been used in all these applications because they had the virtues of being stable, reliable, adjustment-free and 'high fidelity' - all of emerged from being designed to be objectively good without listening tests being involved.

At no point did he stop striving for the best distortion performance because of what some scientists were telling him. If it had been an issue of 1% distortion costing $1000 and 0.01% distortion costing $100,000 then I think you might have a point in some circumstances: there would be a cost tradeoff and you might have to decide where on that continuum you chose to be for your application. But as it is, he got performance better than the measuring equipment merely through the right configuration of some $0.10 resistors, inductors and capacitors.

Maybe if he had spent another $100,000 he could have halved the distortion again, and maybe science could show that in some unusual circumstances a human could hear it, but would there be much point?

AES Paper Digest: Sensitivity and Reliability of ABX Blind Testing

Founder/Admin

Member

Founder/Admin

Major Contributor

Founder/Admin

Major Contributor

Major Contributor

Founder/Admin

Master Contributor

Founder/Admin

Addicted to Fun and Learning

Addicted to Fun and Learning

Master Contributor

Master Contributor

Major Contributor

Major Contributor

Active Member

Major Contributor

Master Contributor

Major Contributor

Similar threads