• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Audio Blind Testing - You Are Doing It Wrong! (Video)

pseudoid

Master Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
5,161
Likes
3,501
Location
33.6 -117.9
... EEs or have any sort of formal engineering background -- and they are therefore both unacquainted with what it takes to produce a credible professional paper and uninterested in test results that don't serve their marketing interests...
I qualify but I smelled the wrong kind of fish:
Snag_5e0d1e3.png

4Tau=98.2% and 5Tau is 99.3%.
 

respice finem

Major Contributor
Joined
Feb 1, 2021
Messages
1,867
Likes
3,777
Double-Blind Tests?
:oops: In the 21st Century, us mere mortals are lucky if we can just audition a speaker...
Haha exactly... Though, auditioning speakers and comparing them at the dealer's makes limited sense, because they're almost guaranteed to sound differently in a different room.
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,371
Likes
18,280
Location
Netherlands
Some doubling down on this topic from Audioholics:

Men, Sean Olive really needs to do something about his room acoustics, he sounds like he's in an echo chamber (and while he's at it, buy a decent microphone ;) )
 
Last edited:

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
But how is it useful to prove difference, if doesn’t say anything about the superiority of one over the other?


Before we can establish the superiority of one over the other, we have to establish that there is a difference at all (which can of course be established via measurements but we're discussing subjective impressions here) . Most of the blind testing in audio isn't about actually proving superiority. It's about proving the veracity of a subjective claim that may or may not be based on anything real.
 
Last edited:

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
Haha exactly... Though, auditioning speakers and comparing them at the dealer's makes limited sense, because they're almost guaranteed to sound differently in a different room.

Speakers are sort of a singular issue. Nobody anywhere including here is claiming that speakers don't sound different from one another. I have a bunch of different bluetooth speakers and they all sound fairly distinctly different such that I could probably identify which is which in a blind test. However, in every single case those speakers also have measurably different sonic characteristics. There's no real mystery there. They are doing different things to the signal and as a result they sound different from one another. However, even in the case of speakers, if you have 2 sets of speakers that operate within a certain performance region (as opposed to say a toy vs a good speaker such as presented in magicscreen's hilarious little post earlier in this thread) and you do a bit of work positioning them and EQing them for the room and you then do a carefully volume-matched blind listening test it would be very difficult to identify which was which - even though they would still probably have some measurable differences. That just goes to show how much more sensitive the measurement equipment is than our own ears.

Amps (often) and dacs (even more often) do not have those measurable differences. If they do have them then something is probably wrong with one of them. They also don't have the susceptibility to room effects. They are - other than the specific task they are designed to accomplish - straight line signal processors. On a fundamental level, why would anyone want a dac that "sounds different" from another dac? It's not a DAC/DSP...it's a DAC. Any 2 dacs that do a measurably good job of accomplishing the task that a dac is supposed to do will sound indistinguishable from one another. So, if someone does a blind test and is reliably able to distinguish one dac from another, the assumption has to be that one or the other dac is a crappy dac. Measurements will clearly display which one is the crappy one. There won't be any mystery about it. If on the other hand someone has 2 dacs that do a measurably good job and claims he can hear a difference between them, then a further step needs to be taken before there's any merit to that claim - he needs to show that he can identify that difference without knowing which dac he's listening to.

There's also of course an issue of "verifiability" here right? Anyone can claim on the internet that they've done the blind testing and they've "proven" that this $5000 dac sounds better than this $200 dac even though the two things measure the same. Maybe it's true. Or perhaps they are lying. Or possibly they did a blind test but it wasn't really done correctly. Who knows? The only really meaningful test would be something conducted by a neutral party of some sort...similar to Harman's speaker research. However, if I were a dac manufacturer and I was selling a boutique dac for thousands of dollars which I was confident sounded notably better than cheaper dacs why would I not happily arrange blind test sessions comparing my sweet sounding item with a bunch of cheap (but measurably good) stuff so that I could sell many many units by proving it actually sounds better via some un-measurable magic?
 
Last edited:

Spocko

Major Contributor
Forum Donor
Joined
Sep 27, 2019
Messages
1,621
Likes
3,000
Location
Southern California
Another great much needed master course on controlled audio testing Amir.

I totally agree with the fast audio A-B switching, something I started doing many years ago after I realized how short auditory memory is. I laugh when I read that a reviewer mentions that he doesn't remember ever hearing a particular piece of music sounding as good as through that special usually very expensive gear at some audio venue.

Another great point on your video is listening with the same reference audio tracks one has been using for many years. I've been using the same audio tracks for over 15 years. It always amuses me when I see professional audio reviewers that list their source music du jour that changes from one month to the next; they don't use a valid reference point for observation which becomes a totally subjective exercice.
People's moods change their expectations and attitude, and we know how strongly correlated mood is to decisions in general: specifically, an analysis of criminal sentencing just before lunch and after lunch is quite sobering: longer sentences were rendered just before lunch compared to shorter sentences after lunch - hungry judges were more critical than well fed judges (link to article "Lunchtime Leniency"). So when doing A/B tests, when Amir says all variables must be addressed, this also includes the mood of the reviewer which I believe may have more of an influence than anything else. Would your review be different if your wife suddenly left you after yelling at you "you love these stupid A/B tests than you love me! I'm done, I'll be leaving you for the exotic cable salesman who has a Ferrari."
 

abdo123

Master Contributor
Forum Donor
Joined
Nov 15, 2020
Messages
7,444
Likes
7,954
Location
Brussels, Belgium
Consumers should not be concerned with blind listening imo, we don't listen blind, ever. So what's the point?

Consumers should spend their time educating themselves so they don't waste their money on unnecessary / inferior products to begin with.
 

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
Consumers should not be concerned with blind listening imo, we don't listen blind, ever. So what's the point?

Consumers should spend their time educating themselves so they don't waste their money on unnecessary / inferior products to begin with.


...aaaaaaaaaand we determine which are the inferior products by how?
 

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
People's moods change their expectations and attitude, and we know how strongly correlated mood is to decisions in general: specifically, an analysis of criminal sentencing just before lunch and after lunch is quite sobering: longer sentences were rendered just before lunch compared to shorter sentences after lunch - hungry judges were more critical than well fed judges (link to article "Lunchtime Leniency"). So when doing A/B tests, when Amir says all variables must be addressed, this also includes the mood of the reviewer which I believe may have more of an influence than anything else. Would your review be different if your wife suddenly left you after yelling at you "you love these stupid A/B tests than you love me! I'm done, I'll be leaving you for the exotic cable salesman who has a Ferrari."

sure, but if blind A/B tests are subject to the whims of mood, subjective listening tests would have to be even more so right?
 

abdo123

Master Contributor
Forum Donor
Joined
Nov 15, 2020
Messages
7,444
Likes
7,954
Location
Brussels, Belgium
...aaaaaaaaaand we determine which are the inferior products by how?

with the results of the blind listening reviewers and manufacturers (should) do, the manufacturer should convince you to buy the product, you shouldn't put that effort yourself.

KEF, Genelcs, and Neumann all provide full spinorama data for most of their speakers. I don't need blind listening to know that their speakers are engineering driven.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,476
Likes
4,094
Location
Pacific Northwest
Great introduction to testing.
At about 25 mins, he mentions he'd like to be 99% or more accurate. Of course we all would! However, some differences are so subtle that we can't always score that high. It would be helpful to add discussion of precision/specificity versus recall/sensitivity, or false positives vs. false negatives.

For example, 2 different ways to achieve 95% confidence:
An obvious difference, get the first 5 right and you're done (96.88% confidence).
For a subtle difference that you can't get right every time, get 20 of 30 (95.06% confidence).

Problem is, detecting subtle differences close to the threshold of perception requires so many trials, the test may produce false negatives due to listener fatigue.
 

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
with the results of the blind listening reviewers and manufacturers (should) do, the manufacturer should convince you to buy the product, you shouldn't put that effort yourself.

KEF, Genelcs, and Neumann all provide full spinorama data for most of their speakers. I don't need blind listening to know that their speakers are engineering driven.

Well yeah, I'm generally fine with measurements too. I know for instance that any 2 dacs that measure good are going to work for me. I don't think anyone is saying we all need to do blind tests all the time to make our buying decisions. That's not really what this is about. I think for individuals (or for youtube audio reviewers), blind testing comes into play when we think we are hearing something that we probably aren't actually hearing. If I think I hear some special sauce in a $5000 dac that I don't hear in a (very-similar measuring) $200 dac, before I go online and post my video I might want to do a bit of blind testing just to confirm even to myself that it's not just a figment of my imagination.
 

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
Great introduction to testing.
At about 25 mins, he mentions he'd like to be 99% or more accurate. Of course we all would! However, some differences are so subtle that we can't always score that high. It would be helpful to add discussion of precision/specificity versus recall/sensitivity, or false positives vs. false negatives.

For example, 2 different ways to achieve 95% confidence:
An obvious difference, get the first 5 right and you're done (96.88% confidence).
For a subtle difference that you can't get right every time, get 20 of 30 (95.06% confidence).

Problem is, detecting subtle differences close to the threshold of perception requires so many trials, the test may produce false negatives due to listener fatigue.

Uh-huh, but if it's so close to the threshold of perception that it requires trial after trial to try and identify in a blind test, how is it even in the realm of possibility that it's a difference notable in the sort of comparisons the average subjective review undertakes? Like...how could something so subtle be reliably identified in a comparison between this new dac I just bought and the one I had before that I last listened to a minute ago or a couple days or a couple weeks ago?
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,476
Likes
4,094
Location
Pacific Northwest
Uh-huh, but if it's so close to the threshold of perception that it requires trial after trial to try and identify in a blind test, how is it even in the realm of possibility that it's a difference notable in the sort of comparisons the average subjective review undertakes? Like...how could something so subtle be reliably identified in a comparison between this new dac I just bought and the one I had before that I last listened to a minute ago or a couple days or a couple weeks ago?
For audio reviewers who claim the differences are obvious, your point is valid. But I'm talking about a different case. Sometimes when I listen to differences, they are so subtle I'm not sure if they are actually there. I want to know whether they are real, or just my imagination. Also, companies building audio gear, engineers designing codecs, need to test near the limits of perception. This requires a lot of trials, so I would imagine they use techniques to mitigate listener fatigue, or have a mathematically valid way to aggregate shorter trials done on different days. Either would be interesting to share here.

PS: for example, consider the following set of tests, each conducted on different days:
Test 1: 7 trials, 5 correct, 77.34% confidence
Test 2: 8 trials, 5 correct, 63.67% confidence
Test 3: 6 trials, 4 correct, 65.63% confidence
Test 4: 9 trials, 6 correct, 74.61% confidence
None reached 95% confidence. Can we simply sum them? If so it's 30 trials, 20 correct which is 95.06% confidence.
Intuitively, if you do only slightly better than random guessing on a short test, it might just be luck. But if you do only slightly better than random guessing every time, consistently, repeatedly, then you can still reach high confidence with enough trials.
 
Last edited:

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
For audio reviewers who claim the differences are obvious, your point is valid. But I'm talking about a different case. Sometimes when I listen to differences, they are so subtle I'm not sure if they are actually there. I want to know whether they are real, or just my imagination. Also, companies building audio gear, engineers designing codecs, need to test near the limits of perception. This requires a lot of trials, so I would imagine they use techniques to mitigate listener fatigue, or have a mathematically valid way to aggregate shorter trials done on different days. Either would be interesting to share here.

I suppose. I don't really know why those sorts of differences would even matter - especially given we can measure differences to a far more detailed degree than what we can actually hear. To me, something that is so close to the edge of perception in a careful, focused listening test situation just isn't going to have any general, real-world listening significance.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,476
Likes
4,094
Location
Pacific Northwest
... To me, something that is so close to the edge of perception in a careful, focused listening test situation just isn't going to have any general, real-world listening significance.
To normal people, of course you're right. But we're talking about audiophiles here, people who really care about, or at least are curious about, the most subtle barely perceptible differences.
 

Jim Shaw

Addicted to Fun and Learning
Forum Donor
Joined
Mar 16, 2021
Messages
616
Likes
1,159
Location
North central USA
I occasionally follow PS Paul's infomercials on YT. I fully expect to hear this toward the end:

"But here's our special offer just for the next 4 hours: Buy one PSAudio Blunderbus 2000 at the regular price and we'll include a Bombshell 2000 at no extra cost. Call us now at 1-800-555-1234. Operators are standing by to take your order. But you must call that number in the next 4 hours. 1-800-555-1234. Have your credit or debit card ready..."

;)
 

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,160
Location
Winnipeg Canada
PS: for example, consider the following set of tests, each conducted on different days:
Test 1: 7 trials, 5 correct, 77.34% confidence
Test 2: 8 trials, 5 correct, 63.67% confidence
Test 3: 6 trials, 4 correct, 65.63% confidence
Test 4: 9 trials, 6 correct, 74.61% confidence
None reached 95% confidence. Can we simply sum them? If so it's 30 trials, 20 correct which is 95.06% confidence.
Intuitively, if you do only slightly better than random guessing on a short test, it might just be luck. But if you do only slightly better than random guessing every time, consistently, repeatedly, then you can still reach high confidence with enough trials.


Yeah, I'd personally find that a fairly convincing result in a good blind test. But if we ran say ten tests over ten days and there was a few days with 9 trials, 4 correct it would be less so.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,476
Likes
4,094
Location
Pacific Northwest
Yeah, I'd personally find that a fairly convincing result in a good blind test. But if we ran say ten tests over ten days and there was a few days with 9 trials, 4 correct it would be less so.
Of course - if summing like this is mathematically correct (it seems so, though I'm not sure), then one must sum them ALL. No cherry picking!

Yep, and who also have an entirely unrealistic notion about what their ears are actually capable of. lol
Haha, yes. But if they're reading this thread, there is hope :)
 
Last edited:
Top Bottom