• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

DAC ABX Test Phase 1: Does a SOTA DAC sound the same as a budget DAC if proper controls are put in place? Spoiler: Probably yes. :)

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,205
Location
North-East
The p-value shown in your screen shot seems to be wrong. For 10 correct answers from 10 trials, it should be 0.001 or 0.1%, not 0.00169% as shown. I do see this ABX tool producing correct p-values for other tests, so I'm not sure how you got that particular result...

1640727012322.png


Indeed I have. First I completed the pretty obvious ABX demo and got the expected results.

Then I also tested the tool behaviour while I was measuring the streamed audio quality with my own 1kHz test tones, here's the result of that:
View attachment 175167
I.e. behaviour of the ABX tool in all my test trials was consistent and exactly as expected - I found no such issues.

Perhaps I should note that I too cannot hear any difference in my E50/D03K ABX test (my HF hearing ends somewhere between 16-17kHz).
 

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,383
Likes
24,749
Location
Alfred, NY
These amateurish forum debates are hopeless, turning around in circles with same neverending arguments.
We’ll keep staring at the ground and drooling.
 
OP
dominikz

dominikz

Addicted to Fun and Learning
Forum Donor
Joined
Oct 10, 2020
Messages
803
Likes
2,626
The p-value shown in your screen shot seems to be wrong. For 10 correct answers from 10 trials, it should be 0.001 or 0.1%, not 0.00169% as shown. I do see this ABX tool producing correct p-values for other tests, so I'm not sure how you got that particular result...

View attachment 175177
Thanks - my statistics is admittedly rusty so please excuse me if I'm mistaken, but wouldn't the value you suggest be valid only for a test with two possible selections per trial (standard binomial test with two outcomes, as with typical ABX)? Note that the screenshot I posted was for a test with three possible selections per trial (three outcomes; i.e. ABCX test, if you will).
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
@jaakkopasanen, IMHO some more unit tests will provide more confidence in the tool's results, specially considering that you have implemented all probability functions from scratch. You know, TDD ;).
 

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,205
Location
North-East
Thanks - my statistics is admittedly rusty so please excuse me if I'm mistaken, but wouldn't the value you suggest be valid only for a test with two possible selections per trial (standard binomial test with two outcomes, as with typical ABX)? Note that the screenshot I posted was for a test with three possible selections per trial (three outcomes; i.e. ABCX test, if you will).

Ah, that's possible. It looked like a standard ABX test.
 
OP
dominikz

dominikz

Addicted to Fun and Learning
Forum Donor
Joined
Oct 10, 2020
Messages
803
Likes
2,626
Ah, that's possible. It looked like a standard ABX test.
No worries! Here's how it calculates p-value for 10 correct if only two files are compared (two outcomes, standard ABX):
1640729076539.png

With two outcomes p-value is indeed approx. ~0,1%, as you expected.

A multinomial pmf is used in the code.
Indeed, from the tool webpage:
p-value is calculated automatically for each individual test with the polynomial probability mass function.
 

Pdxwayne

Major Contributor
Joined
Sep 15, 2020
Messages
3,219
Likes
1,172
Indeed I have. First I completed the pretty obvious ABX demo and got the expected results.

Then I also tested the tool behaviour while I was measuring the streamed audio quality with my own 1kHz test tones, here's the result of that:
View attachment 175167
I.e. behaviour of the ABX tool in all my test trials was consistent and exactly as expected - I found no such issues.

Perhaps I should note that I too cannot hear any difference in my E50/D03K ABX test (my HF hearing ends somewhere between 16-17kHz).
Interesting....

With phone, I could clearly hear a different between A vs B. But when using my laptop to my dac and amp combo, the difference I heard for A pretty much gone....
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
Indeed, from the tool webpage:
p-value is calculated automatically for each individual test with the polynomial probability mass function.
The generalization of the binomial distribution when there are more than two categories is called the multinomial distribution, and this is what the function multinomialPMF implements in the project. I'm not aware of any author that calls this polynomial distribution. I guess it's a typo.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
An SPL meter is not only expensive, but is a lousy tool for this purpose. Measure the electrical signal, it’s more accurate and far cheaper. Or use a software tool like Adobe or Goldwave.
In your experience, which precision is required to be considered "matched"?
everyone can use an AC multimeter and a sine test tone.
Sorry to ask this. Is it that simple or I'm missing something? Thanks.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,383
Likes
24,749
Location
Alfred, NY
In your experience, which precision is required to be considered "matched"?
0.1 dB. And yes, it's that simple. Given that these exist as files, it's even easier to make the adjustment in software.
 

GaryH

Major Contributor
Joined
May 12, 2021
Messages
1,348
Likes
1,804
If there is a need I can of course post the files directly, but could you please clarify why you feel comparison via abxtests.com tool is not 'proper'?
I think the various issues others have posted have made that obvious by now.
Perhaps the only part missing is the ability to loop just a specific section of the test track (rather than looping the whole thing).
This is one of the biggest of those issues. The other big issue is the very basic tenet of any scientifically valid test - control for potential confounding variables. There is no way you can test and know what every browser on every system running though every OS mixer is doing to the audio stream. Requiring everyone to do the ABX test in Foobar under WASAPI exclusive mode eliminates these potential confounding variables. There's a reason the Foobar ABX Comparator is the de facto standard on internet forums. It works. There's no need for online ABX tools, all they do is add more possible confounding variables.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
There is no way you can test and know what every browser on every system running though every OS mixer is doing to the audio stream.
This is true. However, if it does the same to all samples in the same computer for the same test, it doesn't represent a problem to consider the test valid if someone obtains a p-value of 0.05. In that case, the subject can differentiate the two samples, regardless of what the computers does. It's hard to imagine a case in which due to this effects of different software platforms the subject is able to score 16/16 by chance.

It's also true of course that If there are issues with the reproduction, this could invalidate the tool as a good tool, as it would make the task unnecessarily difficult, but that's another story. Just speculating, haven't tried it yet.

In summary: success tests are valid, failed tests can be failed for any reason, including a failure in the tool, but that doesn't prove anything. If nobody can pass the test, then we can speculate that the tool isn't good, just like with any other tool, but this issue is common to any measuring device in any context.

This is how I see it. Please correct if I'm wrong.
 
Last edited:
OP
dominikz

dominikz

Addicted to Fun and Learning
Forum Donor
Joined
Oct 10, 2020
Messages
803
Likes
2,626
Looks like the tool is showing P(X = x) instead of P(X >= x). With 8 out of 16 it shows 0.196, instead of 0.598:
View attachment 175190
Thanks for pointing this out! I'll be sure to include the P(X >= x) p-value when I prepare my overview. Tagging @jaakkopasanen in case he wants to look into it for his tool.

I think the various issues others have posted have made that obvious by now.
Since the cat's anyway out of the bag already :) let me start by providing links to original files for offline use with foobar2000 ABX comparator (as requested by @KSTR and @Pdxwayne as well):

1. Topping E50 sample file
2. FiiO Taishan D03K sample file

To be fair, there are currently really only 3 criticisms regarding abxtests.com tool as far as I can tell:
1) The streamed sound quality of the tool may be compromised with certain end-user setups and systems, since these are out of control of the test. This is of course a valid concern, but IMHO doesn't necessarily make the test invalid. Note that foobar2000 ABX comparator is also not immune to similar type of criticsm. More on this below.
2) The tool calculates p-value based on P(X=x) instead of P(X>=x). This alone doesn't make the test results themselves invalid, as we can still calculate P(X>=x) from the same test data.
3) The tool doesn't let you seek or select a smaller subset of the track to loop. While I'd love for the tool to have this functionality, it is an added feature and lack of such functionality doesn't imply invalid ABX methodology. It just makes it more difficult to identify differences.

To clarify, I have nothing against foobar2000 ABX comparator and my motivation for using the abxtests.com tool instead was:
- Hope that it would make the test more accessible and easier to use
- Try something new and support IMHO a really good effort of a fellow forum member

This is one of the biggest of those issues. The other big issue is the very basic tenet of any scientifically valid test - control for potential confounding variables. There is no way you can test and know what every browser on every system running though every OS mixer is doing to the audio stream. Requiring everyone to do the ABX test in Foobar under WASAPI exclusive mode eliminates these potential confounding variables. There's a reason the Foobar ABX Comparator is the de facto standard on internet forums. It works. There's no need for online ABX tools, all they do is add more possible confounding variables.
We should however be honest and admit that with these kind of remote, self-applied tests there is no way to control all variables, regardless of the application/tool.

Note that we could make a similar argument that some participants might not use good enough transducers, don't have correct gain-staging, don't apply room correction with loudspeakers etc...
The more variables we can control the better, naturaly, but we never can control everything with such tests.
This applies to foobar2000 ABX comparator or any other similar tool as well.

Given that this thread is not meant to be a scientific article, I hope we can live with some level of uncertainty and still find some value in the data provided :) (though I agree it is important to be aware of limitations as well, and to avoid making hasty conclusions).

This is true. However, if it does the same to all samples in the same computer for the same test, it doesn't represent a problem to consider the test valid if someone obtains a p-value of 0.05. In that case, the subject can differentiate the two samples, regardless of what the computers does. It's hard to imagine a case in which due to this effects of different software platforms the subject is able to score 16/16 by chance.

It's also true of course that If there are issues with the reproduction, this could invalidate the tool as a good tool, as it would make the task unnecessarily difficult, but that's another story. Just speculating, haven't tried it yet.

In summary: success tests are valid, failed tests can be failed for any reason, including a failure in the tool, but that doesn't prove anything. If nobody can pass the test, then we can speculate that the tool isn't good, just like with any other tool, but this issue is common to any measuring device in any context.

This is how I see it. Please correct if I'm wrong.
I also share a similar view :)
 

Pdxwayne

Major Contributor
Joined
Sep 15, 2020
Messages
3,219
Likes
1,172
Thanks for pointing this out! I'll be sure to include the P(X >= x) p-value when I prepare my overview. Tagging @jaakkopasanen in case he wants to look into it for his tool.


Since the cat's anyway out of the bag already :) let me start by providing links to original files for offline use with foobar2000 ABX comparator (as requested by @KSTR and @Pdxwayne as well):

1. Topping E50 sample file
2. FiiO Taishan D03K sample file

To be fair, there are currently really only 3 criticisms regarding abxtests.com tool as far as I can tell:
1) The streamed sound quality of the tool may be compromised with certain end-user setups and systems, since these are out of control of the test. This is of course a valid concern, but IMHO doesn't necessarily make the test invalid. Note that foobar2000 ABX comparator is also not immune to similar type of criticsm. More on this below.
2) The tool calculates p-value based on P(X=x) instead of P(X>=x). This alone doesn't make the test results themselves invalid, as we can still calculate P(X>=x) from the same test data.
3) The tool doesn't let you seek or select a smaller subset of the track to loop. While I'd love for the tool to have this functionality, it is an added feature and lack of such functionality doesn't imply invalid ABX methodology. It just makes it more difficult to identify differences.

To clarify, I have nothing against foobar2000 ABX comparator and my motivation for using the abxtests.com tool instead was:
- Hope that it would make the test more accessible and easier to use
- Try something new and support IMHO a really good effort of a fellow forum member


We should however be honest and admit that with these kind of remote, self-applied tests there is no way to control all variables, regardless of the application/tool.

Note that we could make a similar argument that some participants might not use good enough transducers, don't have correct gain-staging, don't apply room correction with loudspeakers etc...
The more variables we can control the better, naturaly, but we never can control everything with such tests.
This applies to foobar2000 ABX comparator or any other similar tool as well.

Given that this thread is not meant to be a scientific article, I hope we can live with some level of uncertainty and still find some value in the data provided :) (though I agree it is important to be aware of limitations as well, and to avoid making hasty conclusions).


I also share a similar view :)
I have downloaded the songs and I can't do better than guessing using foobar2000. : )

Can you do another one with strong sub bass (with lots of ~30hz or lower tones)? Like the one in https://www.audiosciencereview.com/forum/index.php?threads/bass.18999/post-1017710, https://www.audiosciencereview.com/forum/index.php?threads/bass.18999/post-924704, or other strong bass song in the BASS! thread?

Thanks!
 

KSTR

Major Contributor
Joined
Sep 6, 2018
Messages
2,690
Likes
6,013
Location
Berlin, Germany
To be fair, there are currently really only 3 criticisms regarding abxtests.com tool as far as I can tell:
1) The streamed sound quality of the tool may be compromised with certain end-user setups and systems, since these are out of control of the test. This is of course a valid concern, but IMHO doesn't necessarily make the test invalid. Note that foobar2000 ABX comparator is also not immune to similar type of criticsm. More on this below.
2) The tool calculates p-value based on P(X=x) instead of P(X>=x). This alone doesn't make the test results themselves invalid, as we can still calculate P(X>=x) from the same test data.
3) The tool doesn't let you seek or select a smaller subset of the track to loop. While I'd love for the tool to have this functionality, it is an added feature and lack of such functionality doesn't imply invalid ABX methodology. It just makes it more difficult to identify differences.
Thanks for the files. I'll check if I got bitidentical records in digital loopback.

I have to say I find the music choice unsatisfiyng for these kinds of tests. Not a very nice recording to begin with, very bland contemporary pop sound production.

as for 1) Foobar's ABX pluging is bit-transparent. RME unique bit-test function passes (24bits, as foobar is float32 internal and cannot render that to 32bit fixed point -- not that it would matter here, that is).

3) Is the main downside for me. Roadblocks of this type quickly lead to fatigue and demotivation.

One point to mention :
4) Crossfade is normally forbidden for ABX as that may easily introduce phasing which can be a false clue even when the clocks are very close to each other.... if they are not, timing difference will be an ever greater false cue.
A silence gap is very annoying on the other hand. Crossfade through level-tracking noise would be the best option to provide better auditory stream continuity and resolve the phasing issue when the low clock offset condition is met.

I've checked my loopback-recorded files and clock drift and notably clock offset is really surprisingly small (subsample regions) and therefore in this case the phasing is minimal but noticable when you know what to listen for and the x-fade happens to be at points in the files where the phasing is maximum.

Other than that, the measurements you posted contain a clear pointer what to listen for to start with... I suggest disclosing it as the goal of an ABX is maximum sensitivity which is readily increased with prior knowledge about the obvious differences.
 
Last edited:

KSTR

Major Contributor
Joined
Sep 6, 2018
Messages
2,690
Likes
6,013
Location
Berlin, Germany
I'll check if I got bitidentical records in digital loopback.
Confirmed.
I also note the professional trimming and level matching, thanks @dominikz for the almost flawless preparation and execution of the test. That's the level of thoroughness we need here.

Care to share the original snippet as well?
 
Top Bottom