• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

DAC ABX Test Phase 1: Does a SOTA DAC sound the same as a budget DAC if proper controls are put in place? Spoiler: Probably yes. :)

1665350386961.png

As we see, out of the total 350 attempts we have a total of 19 attempts that beat the lax <5% p-value criteria; out of those 19 attempts 10 were borderline for the more strict <1% criteria, and only 4 were well below it - scoring 15 or all 16 correct out of 16 trials.
As I've commented before, statistically there are two populations here, as the 4 cases on the right are very unlikely to be generated by a gaussian with the sample mean and variance. A simple t-test would confirm that, but it's quite obvious. Of course it can happen by change, but this chance is really low, or put in another way, the probability of the outcome you get, the distribution, is generated by a gaussian with the sample mean and variance is very low, so it's more likely there are two populations. Intuitively, gaussians decay very quickly and very deeply, they don't revive so to speak.
Note that here I'm saying 'attempts' instead of 'participants' - this is because a few participants reported they took the test more than once.
Do you know if the 4 attempts on the right are performed by people with other attempts? Are they performed by 4 different subjects? I ask this because the variable person is meaningful here, as one's knwoledge, training and capacity make all the difference, so we can't loose track of this variable.

From here perhaps next steps would be to confirm that these people are indeed capable of always telling apart the tow DACs, and then just ask them how they do it, so we all can learn. Then this knowledge could be taught to a small random sample of people of the read reagion and make them do the test again, and if they can then prove to be able to tell the DACs apart, you would have a very nice proof that with training, anyone can hear very subtle differences. Or I'm just building castles in the air? :)
 
@xaviescacs Thanks a lot for (another) thoughtful response!

Do you know if the 4 attempts on the right are performed by people with other attempts? Are they performed by 4 different subjects? I ask this because the variable person is meaningful here, as one's knwoledge, training and capacity make all the difference, so we can't loose track of this variable.
I cannot reliably say, sadly - there are clues in the metadata sometimes that may indicate the same person took the test multiple times, but really I cannot be certain. There is no way for me to identify individual test participants - it is an anonymous test.

Also, please note that it is possible to 'cheat' in this test, as the FR differences between sound clips in the stream can be measured - see post #76 where I show the difference between the files, and then posts #140 and #141 where another user suggested this. Another example of how the test can be made easier can be found in post #98.

IMHO there is unfortunately not enough control in this test to assume that an individual participant who scored highly actually heard a difference, and it is also quite possible that some of those who scored under the threshold could score better under controlled circumstances and/or with training.

TBH my intention here was just to provide a simple-to-use demonstration illustrating that differences between very different measuring DAC can be much smaller than anticipated (looking at the data and/or price) - and I hoped this would be especially interesting to those that otherwise hadn't had a chance to participate in a level-matched, double-blind ABX test.

From here perhaps next steps would be to confirm that these people are indeed capable of always telling apart the tow DACs, and then just ask them how they do it, so we all can learn. Then this knowledge could be taught to a small random sample of people of the read reagion and make them do the test again, and if they can then prove to be able to tell the DACs apart, you would have a very nice proof that with training, anyone can hear very subtle differences. Or I'm just building castles in the air? :)
To be honest, I was actually originally expecting more participants would be able to tell these DACs apart because there are frequency response differences between them. I'm not sure if that is what most would call 'subtle differences', as many good DACs will have better-matched frequency responses than these two - usually the differences would be in noise level and distortion.

If we pushed this line of investigation forward I personally believe we'd just find that the >10kHz frequency response deviation of the FiiO DAC is meaningful to those with preserved high-frequency hearing who know what to listen for. Removing the FR difference (e.g. with PEQ) would likely make the test even more difficult - and perhaps impossible.

In summary, I'm not sure that concluding what makes one identify these two DACs reliably would allow us to make a conclusion regarding identification of subtle differences in general.

What I'd personally be interested in is whether we can do better in determining the maximum distortion levels that are still inaudible. IMHO then we would have a much clearer view on when mid-performing audio electronics become 'transparent' :)
However I'm not sure if such tests will ever materialize - given the widespread availability of very high performing audio electronics there seems to be less of a practical need to pinpoint the thresholds.
 
got 12 of 16 on first try, but the first 5 or so I was still figuering out where to listen to.
curious if I can do better on second try, later perhaps.

did again and got 11.
So I guess I am not fully guessing, but the diference is too small to be sure?
I guess I could do better with a high pass, but I don't want to mess up the results.

I focussed on the "s" of "myself", SINAD diference seams impossible
 
Your post is thoughtful! :)
Also, please note that it is possible to 'cheat' in this test, as the FR differences between sound clips in the stream can be measured - see post #76 where I show the difference between the files, and then posts #140 and #141 where another user suggested this. Another example of how the test can be made easier can be found in post #98.

IMHO there is unfortunately not enough control in this test to assume that an individual participant who scored highly actually heard a difference, and it is also quite possible that some of those who scored under the threshold could score better under controlled circumstances and/or with training.
I see, thanks for all the references. I agree with you then, controls aren't solid enough to conclude anything, more experiments should be performed to confirm if there is someone really capable of telling the tow DACs apart. My suggestion is a bit naive actually.
TBH my intention here was just to provide a simple-to-use demonstration illustrating that differences between very different measuring DAC can be much smaller than anticipated (looking at the data and/or price) - and I hoped this would be especially interesting to those that otherwise hadn't had a chance to participate in a level-matched, double-blind ABX test.
IMO you accomplished that, and this post is very useful to provide people an example whenever someone says they can hear a difference between DACs etc etc
 
I think the shape of the result shows a clear image. those 4 results on the right are clear outliners which suggest some kind of cheat was used.
 
did again and got 11.
So I guess I am not fully guessing, but the diference is too small to be sure?
If you try 10 times and get 11 or 12 each time, is more or less the same as scoring 15 one time (havent done the numbers, just want to make my point). It's very unlikely that you can score 11 or 12 five or ten times in a row acting randomly. So it looks you have your foot on its throught.
 
Last edited:
I think the shape of the result shows a clear image. those 4 results on the right are clear outliners which suggest some kind of cheat was used.
That's a possible explanation of the outliers. Another is that a subset of people can genuinely hear the difference, as discussed in #182 above.
 
That's a possible explanation of the outliers. Another is that a subset of people can genuinely hear the difference, as discussed in #182 above.

they could be obviously "build diferently", but statisticly there will always be cheaters lol
Maybe they tried stuff out, like the high-pass I personaly considerated; I would just like to try it out....but didn't.
but putting an analyser just to troll is just to easy for nobody having done it
 
did again and got 11.
So I guess I am not fully guessing, but the diference is too small to be sure?
I guess I could do better with a high pass, but I don't want to mess up the results.

I focussed on the "s" of "myself", SINAD diference seams impossible
If you only did these two trials (1st with 12/16 correct, and then 2nd with 11/16) and we assume them to be a single trial with 23 correct out of 32 attempts we get p-value of 1,0031%:
1665559562244.png

(binomial test calculator link)

So while probability of the result being caused by chance is relatively low (but not zero!), the number of incorrect trials is a testament that the audible differences are far from obvious. :D

Either way, good job and thanks for taking the test! :)
 
So while probability of the result being caused by chance is relatively low (but not zero!), the number of incorrect trials is a testament that the audible differences are far from obvious. :D

that was intuitively my conclusion, too.
also matches my impression while listenening.

I think though that this diference might become a little(!) more obvious with material that is harsh, or on the boundery of harsh. a bright mastered rock song with a lot of cymbals for example.

good job with the test.
 
I find this thread fascinating and an eye-opener!


Maybe somebody already asked....in the 350 plus tests taken, can we identify some users who consistenly pick above 12? (or where ever you'd think the number of correct answers become statistically significant).

Cheers,
 
I find this thread fascinating and an eye-opener!
Thanks, I'm very glad you found it interesting!

Maybe somebody already asked....in the 350 plus tests taken, can we identify some users who consistenly pick above 12? (or where ever you'd think the number of correct answers become statistically significant).
I can't identify users, so I can't really say for sure. I also can't be sure whether or not some (of the very few that did score well) used spectrum meters or similar to 'cheat'. There's unfortunately no way for me to control for that - it is a limitation of a remote/online test format. :)

However even so, as you can see, there are in general very few attempts that did well - even in this test where the two DACs measure very differently (there is even a significant frequency response difference between them). I.e. this is a test that should be relatively easy.
 
That might be due to timing. Your test has coincided with holiday season. Many people are traveling or hosting relatives and probably haven't had a good time to sit and do such a test yet. Be patient.

OTOH, when I've posted actual files for people to listen to and choose without knowing, the participation levels have always been abysmal.
On the bright side, no matter what the outcome is, subjectivists will go on believing what they wish. So it all works out.
 
Since test results keep coming in occasionally here's another update. :)

These are the results of participants that took the online test via abxtests.com - we had a total of 516 completed attempts.
Note that here I'm saying 'attempts' instead of 'participants' - this is because a few participants reported they took the test more than once.
Correctp-value (X>=x)How many participants scored?
0100,000%0
199,998%0
299,974%0
399,791%2
498,936%17
596,159%31
689,494%69
777,275%89
859,819%102
940,181%93
1022,725%54
1110,506%31
123,841%14
131,064%8
140,209%2
150,026%1
160,002%3

Note: p-value P(X>=x) has been calculated with this online calculator (n=16, p=0.5, q=0.5, K=<number of correct trials>) and cross-checked here.

Pretty distribution graph:
1726559006230.png

Looks like normal distribution to me, suggesting random / chance selection and consequently that the measured differences between the test files were inaudible to the vast majority of participants.

As we see, out of the total 516 attempts we have 28 attempts that beat the lax <5% p-value criteria; out of those 28 attempts 14 were borderline for the more strict <1% criteria, and only 6 were well below it - scoring at least 14 (or more) out of 16 trials.
Note that one of the 16/16 result is not included in the result overview, as explained in post #141.

To put the above in percentages:
  • 5,4% of total test attempts beat the lax <5% p-value criteria
  • 2,7% of total test attempts are either borderline or better than the more strict 1% p-value criteria
  • 1,2% of total test attempts clearly beat the more strict 1% p-value criteria
In addition to above, we had two participants reporting they also did the test in foobar2000 ABX comparator: one participant got 40 correct out of 64 trials for a total p-value of 2,997% (beating the <5% p-value criteria, but not the more strict <1% criteria), the other participant reported they couldn't hear a clear difference so gave up.

Here's a(nother) replay of closing words from my original overview post :p
In the end, I do hope this was an interesting exercise to those included. Hopefully one that also illustrates the importance of precise level matching and blind listening when doing comparisons of audio equipment.
 
Did any participants scoring 16 say what the 'tell' they were hearing was?
Not to my knowledge, no. One participant did admit to have used an analyzer to get the perfect score - so I've omitted that one result from the result overview.

Could be others used similar methods to "cheat", or could be some participants actually heard the difference easily (which I still think should be possible - given the clear differences in frequency response between the two audio samples). There's unfortunately no way to be sure, given the limited controls of online tests. I wrote about this previously as well - e.g. in post #183.
 
Did any participants scoring 16 say what the 'tell' they were hearing was?
If there are enough participants, it is statistically unlikely that no-one scores perfectly by chance. So it's not about "tell" it's about "stat."
 
Back
Top Bottom