Statistics of ABX Testing

amirm · Mar 19, 2016

Statistics of ABX Testing
By Amir Majidimehr

ABX is by far the most talked about type of listening test on the Internet to demonstrate audible difference between two devices, or two audio files. As the name implies, the test involves two inputs, "A" and "B" and the listener is asked to vote whether X (random version of "A" or "B") is closer to "A" or "B." People tend to think this is a rather new type of test but its history dates back to a paper published in the Journal of Acoustic Society of America circa 1950 (yes, 1950!), by two Bell Labs researchers, Munson and Gardner. Here is part of the abstract:http://scitation.aip.org/content/asa/journal/jasa/22/5/10.1121/1.1917190

“The procedure, which we have called the “ABX” test, is a modification of the method of paired comparisons. An observer is presented with a time sequence of three signals for each judgment he is asked to make. During the first time interval he hears signal A, during the second, signal B, and finally signal X. His task is to indicate whether the sound heard during the X interval was more like that during the A interval or more like that during the B interval.“

The modern day instantiation is a bit different in the way the user has control of the switching but its main characteristic remains that of a "forced choice" listening test. Much like a multiple choice exam in school, the listener can get the right answer one of that two ways: actually matching X to the right input (“A” or “B”), or guessing randomly and getting lucky. Unlike school exam however we don’t want to give credit for lucky guesses. We are interested in knowing if the listener actually determined through listening that X matched one of the two inputs. If so we know that the two inputs differed audibly.

At the extreme above is an impossible problem to solve. Take the scenario of a listener voting correctly 6 out of 10 times. Did he guess randomly and got lucky or did he really hear the difference between "A" and "B" 6 out of 10 times? Either outcome is possible.

Let's note the other complexity which is the fact that a human tester is fallible when it comes to audio testing. What if for example the tester correctly matched X to “A” but clicked on the wrong button and voted it being same as “B?” Having taken many such tests I can tell you that this is a common occurrence. Blind tests rely on short-term memory recall and carefulness, neither of which can be guaranteed to be there during the entire test. This and other sources of imperfection in the test fixture and methodology means some number of negatives needs to be allowed.

How much error is tolerable then? The convention in the industry and research is to accept an answer that has less than 5% probability of being due to chance. Inverted, we are 95% sure the outcome was intentional and not lucky guesses. Why 5%? You may be surprised but it is an arbitrary number that dates back to Fisher’s 1925 book, Statistical Methods for Research Workers, where he proposed one in 20 probability of error (5%). Is there some magic there? How about one in 19 or 21? Common sense would say those results should be just as good but again, people stick to the 95% as if it is commandments from above.

Adding to the confusion is that people don't realize computing this probability of chance is based “discrete” value and hence it jumps in value. The reduction of one more right answer may cause a jump from 3% probability of chance to 8%. There will be no intermediate values between them as there is no way to get half a right answer or some other fraction. This makes it silly then to target a value like 5% which it is not physically achievable value. Or insist that 6% is not good when the next value would be 4%.

These issues have been subject of much discussion in the much wider field than audio. Fisher himself opined in his 1956 paper, Statistical methods and scientific inference, that people should not dogmatically stick to the 5% value. His recommendation which I agree with, is to look at the application at hand and let that determine the right threshold. In matters of life and death, we may want to have lower value of chance for example as compared to something as mundane as audio. So don't be hung up on the 5% number like many people seem to be online or even in published studies.

Computing Significance of Results
Moving beyond the threshold, let’s discuss how we compute the probability of chance itself. We take advantage of the fact that ABX testing has a distribution that is "binomial." The listener either gets the results right or wrong (hence the starting letters "bi" or two outcomes). As you have seen so far, the more right answers the listener gets, the less chances that the outcome is due to chance alone. This picture from the super paper, "Statistical Analysis of ABX Results Using Signal Detection Theory," shows this pictorially:

And the example from the above data:

"Fig. 2 shows that there is a particular chance involved in randomly getting s responses correct out of n trials. To be more confident that the result is not due to chance, more correct responses are needed. As an example, if there were 8 out of 10 correct responses from a Bernoulli experiment, there is a 5.5% chance that it was due to random guesses. The reader should also recognize that given random responses for 100 tests, 5 test results would show at least 8 out of 10 correct responses."

"Cumulative binomial distribution” tables (countless many are online) and calculators can be used to find the number of right answers to achieve 95% confidence out of total number of trials. A much easier method though is to use the Microsoft Excel function "binom.inv.” It uses the same three parameters we would use in those tables: the probability of each outcome which is 0.5, the criteria which is 0.95 (95% confidence), and the number of trials. The returned value is the number of right answers to achieve 5% probability of chance.

In the table below, I have computed the answer for 10, 20, 40, 80 and 160 trials:

Notice something fascinating: as the number of trials increases, the percentage of right answers needed to achieve less than 5% probability of chance shrinks. So much so that at 160 trials, you only need 56% of the answers to be right! Let me repeat, you only need 56% right answers to have high confidence in the results not being due to chance. At the risk of really blowing your mind, we only need 95 right answers out of 160 to achieve 99% confidence. This represents only 59% right answers!!!

The non-intuitive nature of this statistical measure is what gave me the motivation for this article. In 2014, the Audio Engineering Society paper, the audibility of typical digital audio filters in a high-fidelity playback system, was published in whichthe authors tested whether listeners could tell the difference when 24-bit/192 KHz music and versions converted to 44.1/48 KHz sampling at 24 bits and 16. The paper stated that the threshold for achieving 95% statistical significance was 56% right answers. This set off immediate reaction among the online skeptics who claimed foul by saying 56% is not much better than “50% probability of a lucky coin toss.” Never mind that the paper in question was peer reviewed and won the award for the best paper at the AES conference. That alone should have told these individuals they were barking up the wrong tree but online arguments being what they are, folks started to rally behind this mistaken notion.

The study used 160 trials which is the reason I also included that number in the above table. As we have already discussed, we indeed only need 56% right answers to reduce probability of chance to 5%. So the authors were quite correct and it was the online pundits who are mistaken.

Another source of confusion was that people thought 56% right answers is what the test achieved. That was also wrong and indicated people had not understood the test results as outlined in the paper:

The dashed line is the 95% confidence line (@56% right answers). The vertical bars show the percent right answers in each of 6 tests (22050 is 44.1 KHz sampling and 24000, 48 KHz). With the exception of one test, the mean in the others easily cleared the 95% confidence threshold. The authors justifiably state the same:

“The dotted line shows performance that is significantly different from chance at the p<0.05 level [5% probability of chance] calculated using the binomial distribution (56.25% correct comprising 160 trials combined across listeners for each condition).”

Now let's look at another report where online pundits have not questioned, the AES Journal engineering report, Audibility of a CD-Standard A/D/A Loop Inserted into High-Resolution Audio Playback, by Meyer and. Moran. Here is the break down of some of the results:

"...audiophiles and/or working recording-studio engineers got 246 correct answers in 467 trials, for 52.7% correct.

Females got 18 in 48, for 37.5% correct.

...The “best” listener score, achieved one single time, was 8 for 10, still short of the desired 95% confidence level."

We notice the first mistake right away: only indicating the percentage right and not statistical significance. Rightly or wrongly the authors appear to want to sway the reader to think that these are small percentages near 50% "probability of chance," and hence listeners failed to tell the difference between DVD-A/SACD and 16-bit/44.1 converted version. As we have been discussing the percentage right answers does not provide an intuitive feel for how confident the results are. Statistical analysis needs to have been used, not percentages.

Now let's look at how the audiophiles and recording engineers: 246 correct out of 467 trials. Converting this to statistical significance gives 90% confidence that the results were not due to chance. Very different animal than "52.7%" right answers. Just five more right answers would have given us 95% confidence! Is 10% probability that this outcome was due to chance too high? You can be the judge of that. A judgment that you can only make when the statistical analysis presented, not just a percentage right answers.

The next bit is the statistics of female testers: 37.5% correct. To a lay person that may seem to be even more random than random. But this can't be further from the truth. Imagine the extreme case of getting zero right answers. Does that mean the listeners could not distinguish the files? Answer is most definitely no. Indeed it means 0% probability of chance! Why did the tester get zero right? Simple explanation is that he may have read the vote instructions wrong and while he heard the differences correctly, voted wrong. The authors in the previously mentioned paper, "Statistical Analysis of ABX Results Using Signal Detection Theory," give credence to the same factor:

"It is also important to consider the reversal effect. For example, consider a subject getting 1 out of 10 responses correct. According to the cumulative probability, it may be tempting to say that this outcome is very likely: 99.9% in this case. However, consider that your test subject has been able to incorrectly identify „X‟ 9 out of 10 times. The cumulative probability shows the likelihood of getting at least 1 out of 10 correct responses. The binomial probability graph shows that to get exactly 1 out of 10 correct responses has a probability of 0.97%. It is possible that the subject has reversed his/her decision criteria. Rather than identifying X=A as X=A, the subject has consistently identified X=A as X=B. The difference in the stimuli was audible regardless of the subjects possible misunderstanding of the ABX directions or labeling."

In that regard, getting and worth investigating.

The last bit in Meyer and Moran data summary is also puzzling. That the one person who achieved 8 out of 10 answers also did not meet the confidence bar. As I noted in the quote in above paper, and shown in this table from it:

the probability of chance is 5.5% as opposed to 5%. How could that small difference be enough to dismiss the outcome as not significant? Let's remember again that there is no magic in 5% number. As the table shows, 8 out of 10 does give us high confidence that the results were not due to chance.

Summary
Audio listening tests may seem simple but there is much complexity to conducting and analyzing their results. That we have two pairs of ears do not qualify us to run them, or interpret their results. The statistical analysis of ABX test results is one such aspect. It seems that authors of tests like Meyer and Moran have farmed out this work instead of owning it and explaining in the paper. The result is a source of confusion for people who run with the talking points of these reports as opposed to really digging deep and understanding the results.

I wrote this paper when I realized how little of this information exists and how much confusion abounds as a result. While many run with the "95% confidence" or "p<.05," the knowledge level does not exist below this superficial level. Hopefully with this article I have filled the hole in our collective knowledge in this area.

Disclaimer
I have attempted to heavily simplify this topic in order to get the main points across. More Rigorous analysis may be called for before confidence ratings are trusted. For example you need to make sure the inputs were truly randomized. The paper, "Statistical Analysis of ABX Results Using Signal Detection Theory,"is a super review of ABX testing in general and this point in the specific.

References
"Audibility of a CD-Standard A/D/A Loop Inserted into High-Resolution Audio Playback," E. Brad Meyer and David R. Moran, Engineering Report, J. Audio Eng. Soc., Vol. 55, No. 9, 2007 September

“The audibility of typical digital audio Filters in a high-fidelity playback system,” Convention Paper, Presented at the 137th AES Convention 2014

"Statistical Analysis of ABX Results Using Signal Detection Theory," Jon Boley and Michael Lester, Presented at the 127th Convention 2009 October 9–12

"High Resolution Audio: Does It Matter?," Amir Majidimehr, Widescreen Review Magazine, January 2015

Blumlein 88 · Mar 19, 2016

You make some good points in the article.

The test on the filtering is an interesting result. One would have to say if you get 56% correct, and with enough trials then yes the effect is likely real. However, it would also be a small effect. Smaller than some difference that might result in say 150 out of 160 trials being correctly chosen. Testing with only 10 trials you would expect results to cluster around 5 or 6 correct.

So at what level is a difference significant enough to worry about? Some would say any real difference is worth worrying about, but is it? I am no specialist in statistics, but I feel you need around 30 trials to be pretty confident. Of course doing 30 trials is tiring, and difficult for an individual. When I personally do such a thing with Foobar I am comfortable enough doing 20 trials. I think replication of a result on an individual basis is important as well. If you can hit the p .05 level twice there is little chance you were the lucky dice. You also reach a combined confidence of less than .3 % results were random.

Now this is sufficient to find some awfully small differences. Already I am wondering if the smallest of these you pick up in a pair of 20 trial tests is really important enough to matter in your usual listening scenario. Were this academic research or even a company trying to create the utmost gear I can see it. In the hub bub of day to day life and musical listening for enjoyment I am not so sure. Your own feelings, tiredness or freshness, varying background levels of noise and other factors all interfere more than these effects barely detectable in a pair of 20 trial tests. For many genuine though minor effects turning up the volume 1 db is a much, much bigger difference in what you can hear.

amirm · Mar 19, 2016

Personally I aim for 100% success for my blind testing. That way there is no doubt. Using non-expert listeners though makes such goals unreachable.

dallasjustice · Mar 20, 2016

http://www.aes.org/e-lib/browse.cfm?elib=14444

In this study, the authors wanted to test whether listeners could reliably hear the difference between two capacitors with ABX listening tests. They got statistically insignificant results and blamed the ABX format as having stressed the listeners too much. However, they also did the study with simple AB tests and three capacitors. They claim statistically significant results using AB tests. What are the pitfalls to using AB subjective tests versus ABX tests?

Of course, a cap manufacturer sponsored the study.

Jinjuku · Mar 20, 2016

Does the sample size ever get to the point that simply flipping a coin puts you either in or near spitting distance of coin flipping and the 5% or less chance of probability?

Jinjuku · Mar 20, 2016

Also I want to be clear in my understanding. This is 160 trials of individuals going through X number of AB/X? So let's say 160 persons sitting through 20 possible permutations?

Blumlein 88 · Mar 20, 2016

Jinjuku said:
Also I want to be clear in my understanding. This is 160 trials of individuals going through X number of AB/X? So let's say 160 persons sitting through 20 possible permutations?

10000 trials mean 51% is enough for 5% probable it is chance.

20 people doing 8 trials each would be 160 trials. 56% would meet the threshold for 95% significance.

Jinjuku · Mar 20, 2016

Perfect. I just wanted to make sure I was assuming the correct understanding

I can see why some people that didn't think it through debated the math but it makes 100% sense.

Vincent Kars · Mar 20, 2016

My feeling is ABX is a bit too severe.
Not only you do have to hear a difference as in A/B but you also have to label is correctly.

If I play you
A – Beethoven String quartet No.1
B – Dazed and Confused – Led Zeppelin
You can probably perfectly ABX it

If I play you
A – Beethoven String quartet No.1 – Alban Berg Quartet
B – Beethoven String quartet No.1 – Takács Quartet

I won’t be surprised you hear the differences but do you have sufficient clues to label X correctly?

Jinjuku · Mar 20, 2016

Vincent Kars said:
My feeling is ABX is a bit too severe.
Not only you do have to hear a difference as in A/B but you also have to label is correctly.

If I play you
A – Beethoven String quartet No.1
B – Dazed and Confused – Led Zeppelin
You can probably perfectly ABX it

If I play you
A – Beethoven String quartet No.1 – Alban Berg Quartet
B – Beethoven String quartet No.1 – Takács Quartet

I won’t be surprised you hear the differences but do you have sufficient clues to label X correctly?

I think that depends on how well you know the recordings and what you as in individual may be claiming.

Play Holtz's "The Planets" one is Telarc and the other is EMI/Elgar I think I could hit that one for 10 out of 10.

amirm · Mar 20, 2016

dallasjustice said:
http://www.aes.org/e-lib/browse.cfm?elib=14444

In this study, the authors wanted to test whether listeners could reliably hear the difference between two capacitors with ABX listening tests. They got statistically insignificant results and blamed the ABX format as having stressed the listeners too much. However, they also did the study with simple AB tests and three capacitors. They claim statistically significant results using AB tests. What are the pitfalls to using AB subjective tests versus ABX tests?

Of course, a cap manufacturer sponsored the study.

Let me do a Digest write up on it. For now, their testing significantly raises my eyebrows

.

amirm · Mar 20, 2016

Here you go: http://www.audiosciencereview.com/f...-digest-audio-capacitors-myth-or-reality.176/

Don Hills · Mar 21, 2016

Vincent Kars said:
...
If I play you
A – Beethoven String quartet No.1 – Alban Berg Quartet
B – Beethoven String quartet No.1 – Takács Quartet

I won’t be surprised you hear the differences but do you have sufficient clues to label X correctly?

If you consistently hear differences between A and B, you have sufficient clues to say whether X is A or B. The task is to identify X as A or B, not to identify the name of the piece or the quartet.

fas42 · Mar 21, 2016

I decided to give ABX a go a couple of years ago, and found Foobar was pretty well it in a Windows PC situation. Didn't like what I was hearing, both A and B were relatively poor - what's going on? Did some investigating, turns out that the Foobar ABX is badly engineered in terms of how it operates - is very highly dependent on the particular PC behaviours - in my case, the sound quality was so degraded that it was quite pointless using this "tool". And I haven't come across any better options since ...

My opinion is that if the measurement people want to get some genuinely useful feedback, first of all get your tools working well!! A crappy multimeter is not going to serve anyone well; the tools _must_ work as accurately and as transparently as possible for them to be taken seriously ...

Blumlein 88 · Mar 21, 2016

fas42 said:
I decided to give ABX a go a couple of years ago, and found Foobar was pretty well it in a Windows PC situation. Didn't like what I was hearing, both A and B were relatively poor - what's going on? Did some investigating, turns out that the Foobar ABX is badly engineered in terms of how it operates - is very highly dependent on the particular PC behaviours - in my case, the sound quality was so degraded that it was quite pointless using this "tool". And I haven't come across any better options since ...

My opinion is that if the measurement people want to get some genuinely useful feedback, first of all get your tools working well!! A crappy multimeter is not going to serve anyone well; the tools _must_ work as accurately and as transparently as possible for them to be taken seriously ...

Try it again. You do need ASIO or WASAPI connection with your device. If you didn't have one of those it is no wonder you were disappointed. The ABX part has been refined in the past year or so.

fas42 · Mar 21, 2016

I went through all that ASIO and WASAPI stuff - it wasn't possible on one machine, and didn't help on another. I wouldn't have objected so strongly if not for the fact that a simple Nero player did a far better job, straight off the mark, on normal playing of material.

If the ABX mechanism has been refined just lately it might be worth trying again - I'll check it out, thanks ...

Vincent Kars · Mar 21, 2016

Don Hills said:
If you consistently hear differences between A and B, you have sufficient clues to say whether X is A or B. The task is to identify X as A or B, not to identify the name of the piece or the quartet.

Maybe I'm a bit unclear (or simply wrong)
What I'm wondering is if the difference are subtle if I really do have sufficient clues to label A en B correctly.
I can imagine if you play me X, I can reliably answer the question if X2 differs from X1 but are still at loss if it is A or B.

I do think the task answering "same or different" easier than labeling it A or B correctly.
“Same or different” I can do on the spot. Labeling X correctly I do think requires substantial priming if the differences are subtle.

amirm · Mar 21, 2016

There is a quandary with ABX testing. And that is the fact that we want to use that type of testing when differences get small, i.e. beyond obvious. For example we don't do ABX tests of two different songs, or two different speakers. We "know" they are different and ABX test would just tell us that.

On the other hand when differences get very small, then the obvious answer is not in front of us. So we use ABX tests. Problem there is that humans are prone to second guessing themselves. They may very well hear those small differences but think maybe they are imagining it and vote against it.

There are two partial solutions/answers to above:

1. Use training and trained listeners. By training testers can amplify in their mind the audible differences by ignoring everything else. One example is in professional video where we turn off the color in the monitor and examine the black and white. Problems that were hidden now become more visible. Likewise, a trained listener can hone in on a specific artifact and ignore the other "pretty" stuff that exists in music.

In that regard it is important that we don't dilute the votes trained listeners with ordinary listeners. That is a sure way to get combined results of "no better than chance." If we are building products for mass market that would be fine but not if we aim to build the best in class for high-end customers.

2. Controlled tests show that ABX tests, due to ability to instantly switch between stimulus, are able to bring out differences far smaller than slow, long term, AB tests. I can dig up the AES paper if there is interest but trust me that the test was done

. For me this is absolutely true. Many of you know of the ABX tests that I have passed. None of that would have been possible if I could not zoom into a small segment and keep repeating it.

What this means is that we have an imperfect tool. But like many imperfect tools, with correct use we can get far more reliable data than throwing it out and instead using sighted tests.

Fitzcaraldo215 · Mar 21, 2016

I agree that ABX is of very limited use in audio these days where we are quite often dealing with small differences. Our limited acoustic memory plays a significant role in that. And, as differences become smaller, test fatigue will become an increasing factor, helping to bias the results more toward pure guesses and results to no better than chance, even where differences exist. It gets frustrating when differences are small. This might partly explain why I have seen many more negative ABX test results than those statistically supporting that there is a difference. And, of course, those negative "no better than chance" or "not statistically significant" results of ABX never, ever conclusively prove there is no actual difference. Besides, as an audiophile, I am much less interested in differences than in preferences, and ABX is of no help with the latter.

Yes, preference testing is more subjective, but if the test samples cover enough test subjects and reveal statistically persistent similarities in preference among them, the results can be useful. The Harman speaker studies illustrate this, and they also allow for more than two DUTs in the test protocol: ABCD preference ranking, as Harman did. Statistically consistent preference also, of course, implies an inevitable difference, which can be assumed.

I have always found the concept of being forced to identify X as A or B to be an "unnatural act". It is not something I have done very much vs. simple AB preference in comparing components myself in audition. The latter seems much more natural and intuitive. And, a learned test strategy to deal with with this not so common listening protocol in ABX would seem necessary for the test subjects. Double blind AB preference testing does not have the rah-rah adherents that ABX does, however. But, arguments by ABX devotees about how conceptually and intuitively "simple" that protocol is for the test subject do not convince me. I do not find it simple or intuitive at all in operation, and it might be influenced by factors not generally conceded by ABX fans.

Trying to use AB rather than ABX for just difference, not preference, testing would also require considerable care, it seems to me. There would, of course, have to be as many random cases where A equals B, as well as where A does not equal B. Otherwise, just guessing "different" would usually be right, destroying the test's usefulness. But, if testing A vs. B is to be done, why just measure "same/different". Why not measure preference and look for statistical consistency of that preference to confirm there is both a difference and a preference? If there is no consistent preference, it might still mean they sound different but neither is preferred to the other. But, how much do we really care about that?

amirm · Mar 21, 2016

Good points Fitz. On your question, I address that in my digest of AES paper on capacitor audibility: http://www.audiosciencereview.com/f...-digest-audio-capacitors-myth-or-reality.176/

There are significant perils in asking people whether they prefer A or B when the two potentially sounding the same.

Statistics of ABX Testing

Founder/Admin

Grand Contributor

Founder/Admin

Major Contributor

Major Contributor

Major Contributor

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Founder/Admin

Founder/Admin

Addicted to Fun and Learning

Major Contributor

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Founder/Admin

Major Contributor

Founder/Admin

Similar threads