Audio Blind Testing - You Are Doing It Wrong! (Video)

Sgt. Ear Ache · Sep 24, 2021

MRC01 said:
Of course - if summing like this is mathematically correct (it seems so, though I'm not sure), then one must sum them ALL. No cherry picking!

Yeah, and with the sort of things we are often talking about here - dacs specifically - it would be likely that the more tests you run the closer you'd get to actual 50%. If you had a dac that was getting correctly identified 60 or 70% of the time test session after test session you'd have to conclude there was something real there.

Katji · Sep 24, 2021

Spocko said:
People's moods change their expectations and attitude, and we know how strongly correlated mood is to decisions in general: specifically, an analysis of criminal sentencing just before lunch and after lunch is quite sobering: longer sentences were rendered just before lunch compared to shorter sentences after lunch - hungry judges were more critical than well fed judges (link to article "Lunchtime Leniency"). So when doing A/B tests, when Amir says all variables must be addressed, this also includes the mood of the reviewer which I believe may have more of an influence than anything else. Would your review be different if your wife suddenly left you after yelling at you "you love these stupid A/B tests than you love me! I'm done, I'll be leaving you for the exotic cable salesman who has a Ferrari."

The wife is right. And the exotic cable salesman, especally if he owns the company. Then she just needs to get him to buy her a Ferrari. And if he's a total arsehole, and he doesn't spend enough time at work, she can make a plan.

MRC01 · Sep 24, 2021

Spocko said:
... an analysis of criminal sentencing just before lunch and after lunch is quite sobering: longer sentences were rendered just before lunch compared to shorter sentences after lunch - hungry judges were more critical than well fed judges (link to article "Lunchtime Leniency"). ...

I wonder if they corrected for confounders. Perhaps serious offenders have higher priority and get scheduled earlier in the day, thus more likely to be before lunch? Of course that is just me guessing one possible example, there could be any number of other confounding factors.

krabapple · Sep 24, 2021

ExUnoPlura said:
Good stuff for deconstructing some of the nonsense out there, but there are so many experimental design issues and biases (individual differences, learning effects, frequency effects, recency effects, etc.) that only large population evaluations start to asymptotically approach something like a valid result. The Harman preference curve is a good example of an aggregate metric that is clearly useful and its development allows for some comparisons without the need for huge testing populations. The bottom line, since the enlightenment, is that public knowledge is valuable and private knowledge (or unverifiable personal gnosis or hope or whatnot) is relegated to a lesser status, expert opinions included.

But if the goal is to determine what *you* can hear, it's quite valid to accept the results of your own (properly done) blind test series.

krabapple · Sep 24, 2021

abdo123 said:
Consumers should not be concerned with blind listening imo, we don't listen blind, ever. So what's the point?

Consumers should spend their time educating themselves so they don't waste their money on unnecessary / inferior products to begin with.

This consumer has done ABX tests on audio.

Your argument seems to be one from incredulity. "No one will take the time, or have the means, to do blind tests!" But: a proper education would be one that informs the consumer what measured differences can be expected to be audible. That way they can evaluate the frequently very dubious claims about audio originating from manufacturers, the audio press, 'audiophiles', recording/production 'engineers', musicians...

Do you imagine consumers will take time to get that education?

Spkrdctr · Sep 24, 2021

MRC01 said:
Of course - if summing like this is mathematically correct (it seems so, though I'm not sure), then one must sum them ALL. No cherry picking!

Haha, yes. But if they're reading this thread, there is hope

You would think so but almost daily I read posts where someone is saying something about hearing things that is not true. Remember my favorite phrase, people think they have God level hearing! Such as me paraphrasing "I don't need an oscilloscope I can trust my ears"! Not knowing that they maybe can trust their ears, but in no way, no how can they trust their brain! 100% of all people can be easily fooled. Hearing is terrible in not being fooled, in fact it seems to enjoy being fooled.

Spkrdctr · Sep 24, 2021

krabapple said:
This consumer has done ABX tests on audio.

Your argument seems to be one from incredulity. "No one will take the time, or have the means, to do blind tests!" But: a proper education would be one that informs the consumer what measured differences can be expected to be audible. That way they can evaluate the frequently very dubious claims about audio originating from manufacturers, the audio press, 'audiophiles', recording/production 'engineers', musicians...

Do you imagine consumers will take time to get that education?

Nope, they will not.

abdo123 · Sep 24, 2021

krabapple said:
Do you imagine consumers will take time to get that education?

Yes if they value their money, it's not like there are 5 of us in this forum.

the behavior i speak of is not very far fetched.

MRC01 · Sep 24, 2021

It's less common, but I have seen the opposite misinterpretation of A/B tests: claiming test results prove there's no audible difference. Absence of evidence is not evidence of absence. Consider the example set of tests I posted earlier. Each one of these tests suggests the person couldn't hear any difference (each one fails to reach the 95% confidence level). But it doesn't prove that fact. Indeed, when you see all the tests together it proves the opposite - this person was indeed hearing a difference.
NOTE: I'm using the word "prove" to mean "within the test confidence %". In this case 95%, but you can devise examples to suit any other confidence value.
PS: to avoid confusion, I'm not saying A/B testing isn't useful --- much the contrary! Such an inconclusive test can accurately be described as "Failed to demonstrate hearing an audible difference".

pseudoid · Sep 24, 2021

Spocko · Sep 25, 2021

MRC01 said:
I wonder if they corrected for confounders. Perhaps serious offenders have higher priority and get scheduled earlier in the day, thus more likely to be before lunch? Of course that is just me guessing one possible example, there could be any number of other confounding factors.

No, dockets are placed in the order they're scheduled based on availability and far enough ahead of time so attorneys are free to re-schedule if it's inconvenient or conflicts with another case.

krabapple · Sep 25, 2021

MRC01 said:
It's less common, but I have seen the opposite misinterpretation of A/B tests: claiming test results prove there's no audible difference. Absence of evidence is not evidence of absence.

Evidence is rarely 'absent' in a complete sense. Data that fails to support a hypothesis ('there's an effect') is still evidence (for the null hypothesis, 'there's no effect'). Scientific truths are conditional, but that doesn't mean that all claims are considered equally likely to be true in practice.

musicforcities · Sep 26, 2021

Great video. It raised a couple questions for me, and maybe these have been addressed before and if I apologize as a newbie.

These have to do with headphone and speaker listening tests. First with headphones: one of the notoriously challenging aspects of testing them via listening is the fact that people’s anatomy are different, both externally in terms of head etc and internally, the outer ear , ear canal etc. Even one person might have very different left and right ears canal geometry (I myself do: my left ear canal is a bit smaller but also differently angled than my right even though my ears look symmetrical on the exterior). The upshot of this is that tonality can change a great deal from listener to listener. The measurement rig creates a base line for an average or ideal anatomical condition, which is GREAT! But the listening tests bu any one person no matter how good they are, depend on the comparative similarity between their anatomy vs the model of the testing rig and moreover, different headphones can interact with a persons anatomy in various ways. For example an angled driver has a different angle relative to the ear canal.

This would seem to make any listening test by a single person statistically meaningless. Or at least you need a sample of 10+ persons with a representative sample of different aural anatomy geometries?

In terms of speakers, I understand the controls set up in terms of speaker placements, listening position etc. And the testing protocols. But I am still fuzzy on how one manages certain conditions. For example, different speakers perform very differently vis a vis placement relative to the front wall and side walls. For example, I have a pair of mirage mrm-1 speakers that are very very picky vis a vis room placement and center imaging. The imaging tends towards wide and diffuse but the phantom center and other imaging snaps into place and becomes precise if one finds the exact right spot (about 3 inches variation). I have other speakers that will present a strong phantom center far more forgivingly, as the cost of directivity. Anyway the point is there are room dependencies a d listener positioning that seem rather hard to control for, or laborious, or at least specific (are they intended for near field, for example). What would be great therefore is to know if the measurements indicate/predict the performance of the speaker in different room conditions rather than any one persons room. For example, certain results that indicate a speaker is going to be difficult to position properly vs those that are more forgiving, etc.

Spocko · Sep 26, 2021

musicforcities said:
Great video. It raised a couple questions for me, and maybe these have been addressed before and if I apologize as a newbie.

Taking account your concerns, the purpose of A/B blind testing is not so much to inform people how better one speaker performs than another, but rather, listener preference in this specific testing environment (aka your living room). As you suggest, the conclusion of any such test is really only useful to the reviewer in helping determine whether they have a preference for certain speaker designs/technologies over another and can then save time in the future by limiting their selection to proven designs they prefer in this particular room. Additionally, if the reviewer is 65 years old, their age related hearing loss will affect their speaker preferences so a well-to-do 22 year old professional would likely not benefit from reviews made by senior citizens who were former roadies for Guns N Roses.

This is why as Amir mentioned in his video, the larger size and sample studies from true A/B testing are more useful thanks to the statistical analysis that happens afterwards! To even begin generalizing, the test requires a larger group study that has data points for age, gender, musical experience, size, time of day, type of music, etc.

Spocko · Sep 26, 2021

musicforcities said:
In terms of speakers, I understand the controls set up in terms of speaker placements, listening position etc. And the testing protocols. But I am still fuzzy on how one manages certain conditions. For example, different speakers perform very differently vis a vis placement relative to the front wall and side walls. For example, I have a pair of mirage mrm-1 speakers that are very very picky vis a vis room placement and center imaging. The imaging tends towards wide and diffuse but the phantom center and other imaging snaps into place and becomes precise if one finds the exact right spot (about 3 inches variation). I have other speakers that will present a strong phantom center far more forgivingly, as the cost of directivity. Anyway the point is there are room dependencies a d listener positioning that seem rather hard to control for, or laborious, or at least specific (are they intended for near field, for example). What would be great therefore is to know if the measurements indicate/predict the performance of the speaker in different room conditions rather than any one persons room. For example, certain results that indicate a speaker is going to be difficult to position properly vs those that are more forgiving, etc.

So you have to ask yourself whether A/B testing is even appropriate for your query. The most bountiful research from Harman to date have focused on listener preferences related to the frequency curve; however, you're talking about the speaker qualities/design/technologies that affect soundstage, imaging, etc. The two most important qualities and measurements affecting spatial cues are directivity measurements and dispersion (wide vs narrow): (1) all speakers benefit from excellent directivity measurements but (2) wide vs narrow dispersion is not only listener preference but sometimes room constrained - wide dispersion in a small untreated room may pose a problem. Additionally, proper room treatment helps imaging/soundstage immensely; ironically, not treating the first reflection is now a thing for stereo listening as it creates a larger soundstage than if you absorbed that first reflection completely (as was the practice 10 years ago). However, for home theater purposes where you have enough speakers to create direct spatial cues, you may want to reduce unwanted first reflections!

musicforcities · Sep 26, 2021

That makes complete sense. I think many “audiophools” however don’t make such distinctions between “measurement tests ” and “preference listening tests” when they dismiss measurements. Or criticize Amir when measurements don’t match his listening (I recall the SVS ultra was one such case?).

And of course they make a category error when they present their “golden ear” listening tests as more reliable and universal, etc.

After reading this site and some material on Audiohaulics, it is clear there really is no reasonable argument against measurements as the primary factor for digital sources, amps, (or cables I guess in so far as they either meet basic specs or not. Lol.)

musicforcities · Sep 26, 2021

BTW, I love the videos. Amir has such warmth and humor in the videos that don’t come across in the same way in the articles. Which unintentionally makes the measurement based argument even more persuasive rhetorically. He is not some arrogant robot that takes himself too too seriously. You can tell he actually enjoys music and wants to demystify audio so more can enjoy it and not get ripped off.

richard12511 · Sep 26, 2021

voodooless said:
Some doubling down on this topic from Audioholics:

Men, Sean Olive really needs to do something about his room acoustics, he sounds like he's in an echo chamber (and while he's at it, buy a decent microphone )

Crazy to see just how much the spread between speaker ratings collapse under blind conditions, but it's something that's shown up in my own tests at home. We like to ascribe these huge differences in sound quality between speakers, but often the truth is that great speakers at very different price points often sound remarkable similar when you can't see which one you're listening to.

Also interesting to see that the cheap HTIB JBL speaker lost to the $3900 Thiel speaker under sighted conditions, but beat it under blind conditions. Great example of why it's important to try and always compare blind.

JRS · Sep 26, 2021

MRC01 said:
For audio reviewers who claim the differences are obvious, your point is valid. But I'm talking about a different case. Sometimes when I listen to differences, they are so subtle I'm not sure if they are actually there. I want to know whether they are real, or just my imagination. Also, companies building audio gear, engineers designing codecs, need to test near the limits of perception. This requires a lot of trials, so I would imagine they use techniques to mitigate listener fatigue, or have a mathematically valid way to aggregate shorter trials done on different days. Either would be interesting to share here.

PS: for example, consider the following set of tests, each conducted on different days:
Test 1: 7 trials, 5 correct, 77.34% confidence
Test 2: 8 trials, 5 correct, 63.67% confidence
Test 3: 6 trials, 4 correct, 65.63% confidence
Test 4: 9 trials, 6 correct, 74.61% confidence
None reached 95% confidence. Can we simply sum them? If so it's 30 trials, 20 correct which is 95.06% confidence.
Intuitively, if you do only slightly better than random guessing on a short test, it might just be luck. But if you do only slightly better than random guessing every time, consistently, repeatedly, then you can still reach high confidence with enough trials.

The error is that you are calculating only the probability of exactly 5/7. One needs to include the probability of 6/7 and 7/7 to correctly estimate that this is not due to a random occurrence when the odds were truly 50/50 for a single trial. By doing the above, the p value is raised to about 0.06.

I didn't check the other ones you mentioned, but assume that once you include >= criteria, they will square. When the trial number becomes large (n=30), it is cumbersome to add all the components, i.e. 20, 21, 22, ...30. (There are tables that give the cumulative probabilities but seldom go beyond about 20). At that point a very good approximation can be made using a "normal" distribution assumption.

MRC01 · Sep 26, 2021

JRS said:
The error is that you are calculating only the probability of exactly 5/7. One needs to include the probability of 6/7 and 7/7 to correctly estimate that this is not due to a random occurrence when the odds were truly 50/50 for a single trial. By doing the above, the p value is raised to about 0.06.

I didn't check the other ones you mentioned, but assume that once you include >= criteria, they will square. When the trial number becomes large (n=30), it is cumbersome to add all the components, i.e. 20, 21, 22, ...30. (There are tables that give the cumulative probabilities but seldom go beyond about 20). At that point a very good approximation can be made using a "normal" distribution assumption.

I didn't make that mistake. I did exactly what you described, summing the probabilities. For example, consider getting 5 of 7 correct:
Getting exactly 5 of 7 correct is 16.41%, exactly 6 is 5.47%, exactly 7 is 0.78%. These sum to 22.66%, which means getting at least 5 of 7 correct by guessing is only 22.66% likely, which makes 1-.2266 = 77.34% confidence.

Audio Blind Testing - You Are Doing It Wrong! (Video)

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Master Contributor

Major Contributor

Master Contributor

Major Contributor

Major Contributor

Senior Member

Major Contributor

Major Contributor

Senior Member

Senior Member

Major Contributor

Major Contributor

Major Contributor

Similar threads