• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Statistics of ABX Testing

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
There seems to be a great confusion about ABX tests here

It's a test for differences
The null hypothesis is that there is no difference
A statistical significant result that rejects the null hypothesis is proof that a difference exists
Failure to reject the null hypothesis does not prove the null hypothesis

It's an interesting twist that I'm getting criticised for citing statistically significant positive ABX test results that prove there is a difference?

And being also asked if I ignore null results when I was sure that null results are of no meaning?

It seems to me a very confused attitude coming from supporters of ABX testing.
From Wiki
An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples (sample A, the first reference, and sample B, the second reference) followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.​
 
Last edited:
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,663
Likes
240,993
Location
Seattle Area
John, that is the theory. The practice in the industry/research is that if we set up a careful test and we get negative results in ABX, it is safe to assume there is no difference. We can't just throw away negative outcomes because "they don't prove anything."
 

FrantzM

Major Contributor
Forum Donor
Joined
Mar 12, 2016
Messages
4,377
Likes
7,877
John

I agree with you previous post... The problem I have with your point re ABX testing is your insistence on how imperfect they are, so much that one could infer that for you full in all knowledge tests are superior. I have no doubt that they're far from perfect but compared to full-viewed and knowledge tests, they are more than sufficient to throw a wrench in most "observation" of massive , "day and night" differences and other hyperbole High End Audio is full of.
 

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
John, that is the theory. The practice in the industry/research is that if we set up a careful test and we get negative results in ABX, it is safe to assume there is no difference. We can't just throw away negative outcomes because "they don't prove anything."
I guess it depends on the question being posed in the test, Amir?
If asking the question is there a difference heard with this setup, these people, this test signal, on this day, etc - a negative outcome can answer it.
If asking the more interesting question "Is there a audible difference between A & B " then null results prove nothing.

If that's wrong I would be interested in learning
Edit: But yes, if it is a carefully set up industry/research test then the sample size will probably significant & there is more coverage. My main criticism is home-run ABX tests which I find fall into "a bad test is worse than no test"
 
Last edited:

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
John

I agree with you previous post... The problem I have with your point re ABX testing is your insistence on how imperfect they are, so much that one could infer that for you full in all knowledge tests are superior. I have no doubt that they're far from perfect but compared to full-viewed and knowledge tests, they are more than sufficient to throw a wrench in most "observation" of massive , "day and night" differences and other hyperbole High End Audio is full of.
My position is that the ABX test has no quality checks within it & therefore requires proctoring by someone who knows how to do perceptual testing. So, home run ABX tests to me are always of questionable quality & I've no real way of knowing what this quality is. And even, on the odd occasion, someone registers a positive ABX test, it doesn't tell me which is better.

As a result I favour anecdotal reports of sighted listening which give equipment used/vol matched tracks used & a description of better/worse differences heard. I can use this information to check the same tracks on my own system for the same factors perceived.

Do I us personal blind tests? Yes, the odd time I'm not sure if there are differences but I really don't get into the ABX statistically significant test - it's informal - I can either hear it consistently over a small number of trials or not. I'm not really interested in the too close to call differences

I also believe that certain types of differences are not amenable to ABX testing or are very difficult to differentiate in ABX.
 
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,663
Likes
240,993
Location
Seattle Area
If asking the more interesting question "Is it possible for anyone to hear a difference" then null results prove nothing.

If that's wrong I would be interested in learning
Even in the case of finding someone guilty of murder in US, we don't use "prove" as the standard. Instead, it is preponderance of evidence. If objective analysis says something shouldn't make a difference, and listening tests demonstrate the same, then we have shown that spending money in that direction is not merited. In that regard a negative outcome from a well done test definitely provides data and bolsters such cases.
 

AJ Soundfield

Major Contributor
Joined
Mar 17, 2016
Messages
1,001
Likes
68
Location
Tampa FL
As a result I favour anecdotal reports of sighted listening which give equipment used/vol matched tracks used & a description of better/worse differences heard. I can use this information to check the same tracks on my own system for the same factors perceived.
So you favor expectation bias, confirmation bias, among other thing that make zero sense. Volume matched?? How on earth do you do that?
 

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
Even in the case of finding someone guilty of murder in US, we don't use "prove" as the standard. Instead, it is preponderance of evidence. If objective analysis says something shouldn't make a difference, and listening tests demonstrate the same, then we have shown that spending money in that direction is not merited. In that regard a negative outcome from a well done test definitely provides data and bolsters such cases.
I don't want to be argumentative but we tend not to have industry/research quality blind tests for equipment/(audio tracks) that we are considering purchasing - at least, I don't
 

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
So you favor expectation bias, confirmation bias, among other thing that make zero sense. Volume matched?? How on earth do you do that?
I didn't say I wanted non volume matched listening (I started a thread on the JND of volume differences with music signal) but I will deal with the issue that I might be influenced by biases - as I said if I think it's close or (I suspect I might be biased or I just want to check randomly ) I will do a informal blind test. But again, I'm interested in bigger differences than close

I'm sure I make mistakes, everybody does but I feel this is the more interesting & progressive way to improve my system than using ABX blind testing
 
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,663
Likes
240,993
Location
Seattle Area
I don't want to be argumentative but we tend not to have industry/research quality blind tests for equipment/(audio tracks) that we are considering purchasing - at least, I don't
We don't. So we rely on tests that qualify categories of products. Our job is to indicate how unlikely it is for some product to make a difference/improvement. How far we go and how much one relies on that is up to the person. Let's get the data on the table as there are a lot more blind tests that we even know about.
 

AJ Soundfield

Major Contributor
Joined
Mar 17, 2016
Messages
1,001
Likes
68
Location
Tampa FL
I didn't say I wanted non volume matched listening (I started a thread on the JND of volume differences with music signal)
You said, quote:
I favour anecdotal reports of sighted listening which give equipment used/vol matched tracks used
What does that mean? You are attempting to recreate the volume of an anecdote you read?
 

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
We don't. So we rely on tests that qualify categories of products. Our job is to indicate how unlikely it is for some product to make a difference/improvement. How far we go and how much one relies on that is up to the person. Let's get the data on the table as there are a lot more blind tests that we even know about.
I'm not sure I understand this, Amir - are you saying that a whole category of devices is qualified by a blind test or a couple of blind tests? For instance amplifiers (just to stay away from DACs for obvious reasons)?
 

AJ Soundfield

Major Contributor
Joined
Mar 17, 2016
Messages
1,001
Likes
68
Location
Tampa FL
The equipment & audio tracks used along with the volume matching done between devices
There is no volume matching with anecdotal stories. None. Zero.
Are you talking about amateur "blind" comparison tests or typical audiophile sighted equipment review, which is one at a time and possibly days to years apart?
 

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
There is no volume matching with anecdotal stories. None. Zero.
OK, I will judge an anecdotal report on how many of my criteria it meets (other factors like are they fanboys, fanatics, what are they comparing to, etc - are all considered). I will also have looked into the technology aspect too. It's not a hard & fast judgement but things like fanaticism/fanboy behaviour almost completely rule out the report - volume matching, not so much. When I then personally do the comparison of what I've short-listed from anecdotal reports, I do volume match as best when I need to.

Are you talking about amateur "blind" comparison tests or typical audiophile sighted equipment review, which is one at a time and possibly days to years apart?
What are you asking here - I don't understand?
 

AJ Soundfield

Major Contributor
Joined
Mar 17, 2016
Messages
1,001
Likes
68
Location
Tampa FL
OK, I will judge an anecdotal report on how many of my criteria it meets (other factors like are they fanboys, fanatics, what are they comparing to, etc - are all considered). I will also have looked into the technology aspect too. It's not a hard & fast judgement but things like fanaticism/fanboy behaviour almost completely rule out the report - volume matching, not so much. When I then personally do the comparison of what I've short-listed from anecdotal reports, I do volume match as best when I need to.
Ok, so unlike the anecdotal report you read somewhere, you perform some sort of volume approximation ("matching") when you do your expectation + confirmation bias view-listening.
So you aren't trying to recreate the conditions of the anecdote, which would have zero volume matching. Ok.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,766
Likes
37,625
OK, I will judge an anecdotal report on how many of my criteria it meets (other factors like are they fanboys, fanatics, what are they comparing to, etc - are all considered). I will also have looked into the technology aspect too. It's not a hard & fast judgement but things like fanaticism/fanboy behaviour almost completely rule out the report - volume matching, not so much. When I then personally do the comparison of what I've short-listed from anecdotal reports, I do volume match as best when I need to.

What are you asking here - I don't understand?

Is your volume matching done by ear?
 

John Kenny

Addicted to Fun and Learning
Joined
Mar 25, 2016
Messages
568
Likes
18
Is your volume matching done by ear?
I'm not anal about it - I've done it with 1KHz tone & multimeter on amp outputs, I've done it with SPL meter, I've done it by ear - never really noticed any preference for louder device - that's one of the reasons I started the JND volume differences thread.

If I'm testing purely digital devices (as I often am) I don't worry about volume
 

fas42

Major Contributor
Joined
Mar 21, 2016
Messages
2,818
Likes
191
Location
Australia
I'm not anal about it - I've done it with 1KHz tone & multimeter on amp outputs, I've done it with SPL meter, I've done it by ear - never really noticed any preference for louder device - that's one of the reasons I started the JND volume differences thread.

If I'm testing purely digital devices (as I often am) I don't worry about volume
I never worry about volume - because it makes zero difference when listening for defects: a mechanic doesn't move closer and further away from a car that's making a funny noise to "adjust the volume" - it's a yes/no decision as to whether there's a problem.

My gripe about so much of this testing is the use of the word "preference". To me, the sound is either 'right' or 'wrong', it's very clear cut - dud system A, dud system B, two unpleasantnesses - I would "prefer" to switch both off, and go outside ...
 
Top Bottom