Statistics of ABX Testing

John Kenny · Apr 2, 2016

There seems to be a great confusion about ABX tests here

It's a test for differences
The null hypothesis is that there is no difference
A statistical significant result that rejects the null hypothesis is proof that a difference exists
Failure to reject the null hypothesis does not prove the null hypothesis

It's an interesting twist that I'm getting criticised for citing statistically significant positive ABX test results that prove there is a difference?

And being also asked if I ignore null results when I was sure that null results are of no meaning?

It seems to me a very confused attitude coming from supporters of ABX testing.
From Wiki

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples (sample A, the first reference, and sample B, the second reference) followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

amirm · Apr 2, 2016

John, that is the theory. The practice in the industry/research is that if we set up a careful test and we get negative results in ABX, it is safe to assume there is no difference. We can't just throw away negative outcomes because "they don't prove anything."

FrantzM · Apr 2, 2016

John

I agree with you previous post... The problem I have with your point re ABX testing is your insistence on how imperfect they are, so much that one could infer that for you full in all knowledge tests are superior. I have no doubt that they're far from perfect but compared to full-viewed and knowledge tests, they are more than sufficient to throw a wrench in most "observation" of massive , "day and night" differences and other hyperbole High End Audio is full of.

John Kenny · Apr 2, 2016

amirm said:
John, that is the theory. The practice in the industry/research is that if we set up a careful test and we get negative results in ABX, it is safe to assume there is no difference. We can't just throw away negative outcomes because "they don't prove anything."

I guess it depends on the question being posed in the test, Amir?
If asking the question is there a difference heard with this setup, these people, this test signal, on this day, etc - a negative outcome can answer it.
If asking the more interesting question "Is there a audible difference between A & B " then null results prove nothing.

If that's wrong I would be interested in learning
Edit: But yes, if it is a carefully set up industry/research test then the sample size will probably significant & there is more coverage. My main criticism is home-run ABX tests which I find fall into "a bad test is worse than no test"

John Kenny · Apr 2, 2016

FrantzM said:
John

I agree with you previous post... The problem I have with your point re ABX testing is your insistence on how imperfect they are, so much that one could infer that for you full in all knowledge tests are superior. I have no doubt that they're far from perfect but compared to full-viewed and knowledge tests, they are more than sufficient to throw a wrench in most "observation" of massive , "day and night" differences and other hyperbole High End Audio is full of.

My position is that the ABX test has no quality checks within it & therefore requires proctoring by someone who knows how to do perceptual testing. So, home run ABX tests to me are always of questionable quality & I've no real way of knowing what this quality is. And even, on the odd occasion, someone registers a positive ABX test, it doesn't tell me which is better.

As a result I favour anecdotal reports of sighted listening which give equipment used/vol matched tracks used & a description of better/worse differences heard. I can use this information to check the same tracks on my own system for the same factors perceived.

Do I us personal blind tests? Yes, the odd time I'm not sure if there are differences but I really don't get into the ABX statistically significant test - it's informal - I can either hear it consistently over a small number of trials or not. I'm not really interested in the too close to call differences

I also believe that certain types of differences are not amenable to ABX testing or are very difficult to differentiate in ABX.

amirm · Apr 2, 2016

John Kenny said:
If asking the more interesting question "Is it possible for anyone to hear a difference" then null results prove nothing.

If that's wrong I would be interested in learning

Even in the case of finding someone guilty of murder in US, we don't use "prove" as the standard. Instead, it is preponderance of evidence. If objective analysis says something shouldn't make a difference, and listening tests demonstrate the same, then we have shown that spending money in that direction is not merited. In that regard a negative outcome from a well done test definitely provides data and bolsters such cases.

AJ Soundfield · Apr 2, 2016

John Kenny said:
As a result I favour anecdotal reports of sighted listening which give equipment used/vol matched tracks used & a description of better/worse differences heard. I can use this information to check the same tracks on my own system for the same factors perceived.

So you favor expectation bias, confirmation bias, among other thing that make zero sense. Volume matched?? How on earth do you do that?

John Kenny · Apr 2, 2016

amirm said:
Even in the case of finding someone guilty of murder in US, we don't use "prove" as the standard. Instead, it is preponderance of evidence. If objective analysis says something shouldn't make a difference, and listening tests demonstrate the same, then we have shown that spending money in that direction is not merited. In that regard a negative outcome from a well done test definitely provides data and bolsters such cases.

I don't want to be argumentative but we tend not to have industry/research quality blind tests for equipment/(audio tracks) that we are considering purchasing - at least, I don't

John Kenny · Apr 2, 2016

AJ Soundfield said:
So you favor expectation bias, confirmation bias, among other thing that make zero sense. Volume matched?? How on earth do you do that?

I didn't say I wanted non volume matched listening (I started a thread on the JND of volume differences with music signal) but I will deal with the issue that I might be influenced by biases - as I said if I think it's close or (I suspect I might be biased or I just want to check randomly ) I will do a informal blind test. But again, I'm interested in bigger differences than close

I'm sure I make mistakes, everybody does but I feel this is the more interesting & progressive way to improve my system than using ABX blind testing

amirm · Apr 2, 2016

John Kenny said:
I don't want to be argumentative but we tend not to have industry/research quality blind tests for equipment/(audio tracks) that we are considering purchasing - at least, I don't

We don't. So we rely on tests that qualify categories of products. Our job is to indicate how unlikely it is for some product to make a difference/improvement. How far we go and how much one relies on that is up to the person. Let's get the data on the table as there are a lot more blind tests that we even know about.

AJ Soundfield · Apr 2, 2016

John Kenny said:
I didn't say I wanted non volume matched listening (I started a thread on the JND of volume differences with music signal)

You said, quote:

I favour anecdotal reports of sighted listening which give equipment used/vol matched tracks used

What does that mean? You are attempting to recreate the volume of an anecdote you read?

John Kenny · Apr 2, 2016

amirm said:
We don't. So we rely on tests that qualify categories of products. Our job is to indicate how unlikely it is for some product to make a difference/improvement. How far we go and how much one relies on that is up to the person. Let's get the data on the table as there are a lot more blind tests that we even know about.

I'm not sure I understand this, Amir - are you saying that a whole category of devices is qualified by a blind test or a couple of blind tests? For instance amplifiers (just to stay away from DACs for obvious reasons)?

John Kenny · Apr 2, 2016

AJ Soundfield said:
You said, quote:

What does that mean?

The equipment & audio tracks used along with the volume matching done between devices

AJ Soundfield · Apr 2, 2016

John Kenny said:
The equipment & audio tracks used along with the volume matching done between devices

There is no volume matching with anecdotal stories. None. Zero.
Are you talking about amateur "blind" comparison tests or typical audiophile sighted equipment review, which is one at a time and possibly days to years apart?

John Kenny · Apr 2, 2016

AJ Soundfield said:
There is no volume matching with anecdotal stories. None. Zero.

OK, I will judge an anecdotal report on how many of my criteria it meets (other factors like are they fanboys, fanatics, what are they comparing to, etc - are all considered). I will also have looked into the technology aspect too. It's not a hard & fast judgement but things like fanaticism/fanboy behaviour almost completely rule out the report - volume matching, not so much. When I then personally do the comparison of what I've short-listed from anecdotal reports, I do volume match as best when I need to.

Are you talking about amateur "blind" comparison tests or typical audiophile sighted equipment review, which is one at a time and possibly days to years apart?

What are you asking here - I don't understand?

Purité Audio · Apr 2, 2016

Have you thought about You Tube?
Keith

AJ Soundfield · Apr 2, 2016

John Kenny said:
OK, I will judge an anecdotal report on how many of my criteria it meets (other factors like are they fanboys, fanatics, what are they comparing to, etc - are all considered). I will also have looked into the technology aspect too. It's not a hard & fast judgement but things like fanaticism/fanboy behaviour almost completely rule out the report - volume matching, not so much. When I then personally do the comparison of what I've short-listed from anecdotal reports, I do volume match as best when I need to.

Ok, so unlike the anecdotal report you read somewhere, you perform some sort of volume approximation ("matching") when you do your expectation + confirmation bias view-listening.
So you aren't trying to recreate the conditions of the anecdote, which would have zero volume matching. Ok.

Blumlein 88 · Apr 2, 2016

John Kenny said:
OK, I will judge an anecdotal report on how many of my criteria it meets (other factors like are they fanboys, fanatics, what are they comparing to, etc - are all considered). I will also have looked into the technology aspect too. It's not a hard & fast judgement but things like fanaticism/fanboy behaviour almost completely rule out the report - volume matching, not so much. When I then personally do the comparison of what I've short-listed from anecdotal reports, I do volume match as best when I need to.

What are you asking here - I don't understand?

Is your volume matching done by ear?

John Kenny · Apr 2, 2016

Blumlein 88 said:
Is your volume matching done by ear?

I'm not anal about it - I've done it with 1KHz tone & multimeter on amp outputs, I've done it with SPL meter, I've done it by ear - never really noticed any preference for louder device - that's one of the reasons I started the JND volume differences thread.

If I'm testing purely digital devices (as I often am) I don't worry about volume

fas42 · Apr 2, 2016

John Kenny said:
I'm not anal about it - I've done it with 1KHz tone & multimeter on amp outputs, I've done it with SPL meter, I've done it by ear - never really noticed any preference for louder device - that's one of the reasons I started the JND volume differences thread.

If I'm testing purely digital devices (as I often am) I don't worry about volume

I never worry about volume - because it makes zero difference when listening for defects: a mechanic doesn't move closer and further away from a car that's making a funny noise to "adjust the volume" - it's a yes/no decision as to whether there's a problem.

My gripe about so much of this testing is the use of the word "preference". To me, the sound is either 'right' or 'wrong', it's very clear cut - dud system A, dud system B, two unpleasantnesses - I would "prefer" to switch both off, and go outside ...

Statistics of ABX Testing

Addicted to Fun and Learning

Founder/Admin

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Founder/Admin

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Founder/Admin

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Master Contributor

Major Contributor

Grand Contributor

Addicted to Fun and Learning

Major Contributor

Similar threads