• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Audio Blind Tests and Listener Training (video)

There is also a reference to an article I wrote a few years ago based on research published at AES on usability of long term listening in blind tests: https://www.audiosciencereview.com/forum/index.php?threads/aes-paper-digest-sensitivity-and-reliability-of-abx-blind-testing.18

There is also a reference to an article I wrote a few years ago based on research published at AES on usability of long term listening in blind tests: https://www.audiosciencereview.com/...ity-and-reliability-of-abx-blind-testing.186/

Please let me know what you think of these lecture heavy videos.
Hi @amirm, thanks for the video! About the long-term memory. Years ago I had Asus Xonar soundcard with Cirrus Logic CS4398 dac chip in several setups. For the last 5 years I didn't used it at all and switched to ES9018/ES9038- based dacs. But a few weeks ago my friend invited me to check out his setup, the speakers and receiver were new to me, the dac was hidden somewhere, but after 1 minute I've got a very strong nostalgic feeling and I knew it was a signature of my old soundcard. I was sure there was Cirrus Logic dac somewhere in his setup and that was true! Do you know what kind of bug could give such a strong signature to those CS4398 dacs? IMD in some freq region, jutter? It's obviously not the frequency response, I believe it's extremely difficult to spoil it in a dac and opamp buffer, right?
 
My work is in the design, analysis and interpretation of clinical trials in some very specific areas. I would like to add a few comments to give perspective to what @amirm states regarding "training" and statistics.
First, delving with statistics. I do have to caution that the hypothesis you test makes a difference. If you believe that they are different and testing for difference is not the same as assuming they are the same and testing to fail that they are the same. It is a rather difficult concept but it requires much different sampling. Not a serious issue for this group but you should have some caution when using "statistics".
The second has to do with training of "testers". I will give a few examples. In rheumatoid arthritis you need to measure joint pain and inflammation in patients. But this could be a very subjective issue for the patient and for the observer making the assessment. In trials, you would have an assessor that is blinded to the treatment, different form the clinician following up on the patient. You will also have all the assessors BEFORE the study starts trained on how to all reach the same degree of inflammation and tenderness (pain) score for each patient and joint. You bring a large group of patients and every assessor sees every patient. Then, the head trialist explains how to make sure they reach the same score. Training is intense because the "joint count" is a key aspect of efficacy in this disease.
The second example is from the "reading" of colonoscopies in studies of Crohn's disease. Even though there is a specific protocol on how to read and grade them, doctors are still unreliable in trials. Most old studies had "local" readings and doctors had an incentive to enter patients that needed to have a minimum score in the scopes. This resulted in high placebo responses in colonoscopies in trials. As this was not biologically likely, a friend of mine took one trial and reread every scope blindly, not knowing if the scope was before or after treatment. What happened was that the blinded scores at entry were much lower than the "local readings". But the scores at the end did not vary much. In currently designed trials, you have a central blinded reader to enter patients (to eliminate local trialist incentive) and then scopes are read at the end of the study without knowing the order. And these biases happen in a very sophisticated environment with highly trained professionals. So, training and limiting subjectivity is essential at every level.
My last example is from a virology study. There, the doctors and patients were open to treatment, but we only took patients that had specific criteria at entry and compared the lab results, which were blinded (the machine had no "idea" who they were) and the virology results were valid. You can't "fix" the lab (except in sports, but that is a digression).
The point is that in many scientific endeavors, the very intense training of assessors or participants is essential to getting reliable results. What @amirm explained is the right way to do this work.
 
Last edited:
I have a bit of a disagreement with Amir's methods as he explained in the Archimago 24 vs 16 bit tests. If you find the special spot, and then turn up the volume to hear it, I consider that unfair augmentation. Useful for perhaps finding if something is audible under any circumstances, but once you find that spot, could you find it at normal listening levels? Usually, or often no. I mean no one is going to listen to a bit of music and eagerly await for the fade to silence at tracks end and jack the volume up to max. So at normal listening levels you could not detect a difference.

Even Amir says you could say he cheated.

In my opinion one of the key things he doesn't emphasize enough is echoic memory. In my experience it isn't beyond 10 seconds. So if you are switching between two things you have to keep it to 5 seconds each max. And keeping it to 2 seconds is better. There are lots of tests you can pass if you pick the right 2 second section to compare which you'll never pass if you listen for even 30 seconds each.

So then you get into the conundrum, that if it takes longer than echoic memory it likely matters not at all in normally listening to music. And boy can some of the distortions be really large and you'll not be able to hear it if you are not allowed to chop it up into segments smaller than 30 seconds.

Still don't misunderstand me, I'm not really being critical of Amir's video. It is an excellent video. He hits all the right high points.
 
My work is in the design, analysis and interpretation of clinical trials in some very specific areas. I would like to add a few comments to give perspective to what @amirm states regarding "training" and statistics.
First, delving with statistics. I do have to caution that the hypothesis you test makes a difference. If you believe that they are different and testing for difference is not the same as assuming they are the same and testing to fail that they are the same. It is a rather difficult concept but it requires much different sampling. Not a serious issue for this group but you should have some caution when using "statistics".
The second has to do with training of "testers". I will give a few examples. In rheumatoid arthritis you need to measure joint pain and inflammation in patients. But this could be a very subjective issue for the patient and for the observer making the assessment. In trials, you would have an assessor that is blinded to the treatment, different form the clinician following up on the patient. You will also have all the assessors BEFORE the study starts trained on how to all reach the same degree of inflammation and tenderness (pain) score for each patient and joint. You bring a large group of patients and every assessor sees every patient. Then, the head trialist explains how to make sure they reach the same score. Training is intense because the "joint count" is a key aspect of efficacy in this disease.
The second example is from the "reading" of colonoscopies in studies of Crohn's disease. Even though there is a specific protocol on how to read and grade them, doctors are still unreliable in trials. Most old studies had "local" readings and doctors had an incentive to enter patients that needed to have a minimum score in the scopes. This resulted in high placebo responses in colonoscopies in trials. As this was not biologically likely, a friend of mine took one trial and reread every scope blindly, not knowing if the scope was before or after treatment. What happened was that the blinded scores at entry were much lower than the "local readings". But the scores at the end did not vary much. In currently designed trials, you have a central blinded reader to enter patients (to eliminate local trialist incentive) and then scopes are read at the end of the study without knowing the order. And these biases happen in a very sophisticated environment with highly trained professionals. So, training and limiting subjectivity is essential at every level.
My last example is from a virology study. There, the doctors and patients were open to treatment, but we only took patients that had specific criteria at entry and compared the lab results, which were blinded (the machine had no "idea" who they were) and the virology results were valid. You can't "fix" the lab (except in sports, but that is a digression).
The point is that in many scientific endeavors, the very intense training of assessors or participants is essential to getting reliable results. What @amirm explained is the right way to do this work.
Sorry but I dont follow: I get that this is a science forum with a lot of technical expertise. But isn’t it quite obvious that there is a huge difference between medical and music issues?

Yes the diehard objectivists are wrong if they claim that nobody can hear a difference as Amir has proven. But nobody listens in that way to music and people with that kind of training are rarer than moon dust. So in medicin there are of course real world implications but here there are none besides proving the point that for some rare people who listen in a very specific way that nobody would use to listen to their favorite music it is possible to pass a test.
 
Sorry but I dont follow: I get that this is a science forum with a lot of technical expertise. But isn’t it quite obvious that there is a huge difference between medical and music issues?

Yes the diehard objectivists are wrong if they claim that nobody can hear a difference as Amir has proven. But nobody listens in that way to music and people with that kind of training are rarer than moon dust. So in medicin there are of course real world implications but here there are none besides proving the point that for some rare people who listen in a very specific way that nobody would use to listen to their favorite music it is possible to pass a test.
With variable quality music systems, variable quality in hearing ability and variable music one is drawing gray lines all the time in audio. Big, fat, wide gray lines around the performance envelope on human hearing.

So you have trained listeners listening in an intense and unnatural way. If they hear no difference, that is about as close as you can get to assuring no one will hear a difference. Or no one will hear a difference with music in the normal way people listen. Same with listening to test tones versus using real music. Test tones are more revealing.

My personal opinion is if one wanted to test oneself over something at home a reasonable approach similar to normal listening is using 30 second snippets of music. It is well past the echoic memory time. No one has any evidence listening longer helps one hear more, in fact just the reverse. 30 seconds is short enough you don't tire out or eat up huge amounts of time. If people have near random results this way, chances are any differences being compared are somewhere between too small to hear or if occasionally audible of such a minor difference it won't impact your musical enjoyment. If something being compared is detected with high consistency this way, it likely is a large enough impairment it is easily measured in most cases. People are more influenced by minor volume differences and wideband frequency response differences than they think they are. And far less influenced by even rather large amounts of distortion than they think they are.

Using this method however, people are prone to convince themselves they hear a difference where none exists. So much so they disbelieve the test results. So one either gets over the hump of believing the results or not. Plus one is so easily influenced by unexpected things. Which is the reason you commonly have someone claim success in single blind testing, and it all disappears with actual double blind testing.

As for medical testing, it isn't always so black and white either. Just look at the various Covid vaccines and how well they work, if they work permanently or not, or if you will need periodic boosters. All stuff yet to be determined. It has been determined they work well enough to be worth getting. One of them doesn't work as well keeping you from getting the disease it seems, but it works well enough you don't die or end up in the hospital. So a gray area there too.
 
...you get into the conundrum, that if it takes longer than echoic memory it likely matters not at all in normally listening to music.

...My personal opinion is if one wanted to test oneself over something at home a reasonable approach similar to normal listening is using 30 second snippets of music. It is well past the echoic memory time. ...

I think the within-echoic-memory test of 5-10sec is actually more relevant than you argue.

A "detectable difference" threshold, discovered by 5-10sec repetitions, might not be relevant to home listening, but I think a "preference" threshold is.

That's why I really like preference tests. I get the impression that, even if it took 5-10sec repetitions to maximise the reliability of the preference score, you are actually enjoying it more, and that extra enjoyment probably exists every second you are listening to that gear.

cheers
 
Highly enjoyable video, well presented with great examples. This is the most enjoyable audio site that I visit - certainly the most educational. It particularly stood out that his "training" was supported by a number of scientists and engineers that formed a feedback network to perfect the training while the trainee supported the scientists end engineers to perfect the products.
@amirm - when you listen for enjoyment, particularly under less than optimum conditions, do you find your training detracts from the enjoyment? This happens in many areas, wines for example...
Please keep up these types of articles in addition to the great reviews. Thanks!
 
I think the within-echoic-memory test of 5-10sec is actually more relevant than you argue.

A "detectable difference" threshold, discovered by 5-10sec repetitions, might not be relevant to home listening, but I think a "preference" threshold is.

That's why I really like preference tests. I get the impression that, even if it took 5-10sec repetitions to maximise the reliability of the preference score, you are actually enjoying it more, and that extra enjoyment probably exists every second you are listening to that gear.

cheers
You have to have a detectable difference before a preference is even possible. I actually don't dwell very much on preference as I once did. In fact I don't know that I can have any preference except with transducers (speakers and microphones). Everything else is pretty much transparent.
 
I trained myself using a much more exaggerated sample that either he had provided or I made myself (I forget which). I selected where that would occur in content and then listened to samples with much less of it. This is standard practice as I explained in the video. It shines light on where a distortion is audible, and where it is not and familiarizes you with what the distortion sounds like. Otherwise you are searching a 3 minute song for a 1 second artifact that you don't even know what it sounds like.

Listener training this way is highly encouraged in standards for listening tests for example. Here is the bible standard, the ITU BS1116:

4.1 Familiarization or training phase
Prior to formal grading, subjects must be allowed to become thoroughly familiar with the test facilities, the test
environment, the grading process, the grading scales and the methods of their use. Subjects should also become

thoroughly familiar with the artefacts under study.


And please note that undergoing such actual ear training is *entirely different from* being 'trained' by virtue of being a 'discerning listener ( aka a golden ear) or having 'SOTA gear'. These are two tenets of high end lore.

Nor would ABX success after training prove that the difference was heard before training.
 
So you have trained listeners listening in an intense and unnatural way. If they hear no difference, that is about as close as you can get to assuring no one will hear a difference.

My point is: if nobody, not even trained experts like Amir, can hear a difference in an average listening situation than that is all we need. I think even the most technical minded person would agree, that listening to little snippets of music and cranking up the volume to extreme levels to listen for tiny artefacts is not average.

What Amir is doing is like: a guy has a perfectly clean apartment done by a professional cleaning service. Then Amir comes with a microscope and a blacklight and proves that the apartment is indeed not perfectly clean. He is 100% right but it has no real-world implication.

For me Amirs video proves that the objectivists are something like 99% right and unlike in other scientific fields it doesnt matter that a black swan like Amir exists.
 
Last edited by a moderator:
But it highlight's things to me :

It was enlightening to see exactly how contrived this testing can be , like cranking up the volume 1000% at the itty bitty end of a fade out tail and put those 5s on repeat .

That's sets the -115dB limit Amir uses into perspective . It really is impossible (in this context , there probably is a black swan somewhere) for all listeners at any contrived setup to pass .

The practical limit maybe a big grey fuzzy zoone between -115dB to-60dB but it depends on so much of content situation kind of anomaly person doing the test etc . It gets fuzzy .

When we then get back to tests of electronics this helps a lot, devices with this kind of -115dB performance really are transparent under even the most absurd contrived circumstances to all human beings it really puts the impossible in there .

So when someone claims magical differences that does not show up at all or show up at -150dB when have as close to hard limit we can get so for electronics we can have blanket statements like "this is inaudible".

It also tells a lot of how a great masker music is .
 
My point is: if nobody, not even trained experts like Amir, can hear a difference in an average listening situation than that is all we need. I think even the most technical minded person would agree, that listening to little snippets of music and cranking up the volume up extreme levels to listen for tiny artefacts is not average.

What Amir is doing is like: a guy has a perfectly clean apartment done by a professional cleaning service. Then Amir comes with a microscope and a blacklight and proves that the apartment is indeed not perfectly clean. He is 100% right but it has no real-world implication.

For me Amirs video proves that the objectivists are something like 99% right and unlike in other scientific fields it doesnt matter that a black swan like Amir exists.
I think the aim of those blind test is getting 100% right at any expense, that's why Amir uses different tricks there. Still it doesn't mean we can't enjoy the benefits of the superior content in real life. At least enjoying possession :) of those 24/192 high-res files vinyl or master tapes.
 
Thank you Amir for the video. I started the video thinking: "puff! Another video from Amir. And it is 44 minutes long! What was he thinking?" But after 44 engaging minutes, I can tell you that this is the best video on the series for me. Really really interesting. It is a shame that you have to pause the video because of the phone call you received. You were going to say something interesting, but when you resumed, the thought was gone.

Anyway, thank you for taking the time to produce these videos. Not all are great, but the ones that are to me are worth the time I spend watching them.
 
I think the aim of those blind test is getting 100% right at any expense, that's why Amir uses different tricks there.

That's what I found a bit disappointing.

It's one thing to be trained to recognize compression artifacts that distinguish (say) MP3 from lossless. It's quite another thing to "distinguish" 16 from 24 bit audio by cranking up the volume in a "silent" passage till you can hear the quantization noise. If you did that "in real life", you'd blow your speakers (and your eardrums) the moment the music came back on.

That's a good way to "pass" the test, but it tells us nothing about the audibility of 16 vs 24 bit in any real-world scenario. In fact, I guarantee that if you retained control of the volume knob, there's no way @amirm would succeed in distinguishing the two.
 
That's what I found a bit disappointing.

It's one thing to be trained to recognize compression artifacts that distinguish (say) MP3 from lossless. It's quite another thing to "distinguish" 16 from 24 bit audio by cranking up the volume in a "silent" passage till you can hear the quantization noise. If you did that "in real life", you'd blow your speakers (and your eardrums) the moment the music came back on.

That's a good way to "pass" the test, but it tells us nothing about the audibility of 16 vs 24 bit in any real-world scenario. In fact, I guarantee that if you retained control of the volume knob, there's no way @amirm would succeed in distinguishing the two.

I am new here. But I am into audio and esp headphones for more than 30 years now. There are some much BS/snake oil forums and reviewers out there it is just ridiculous and for me in 2021 more a psychological than an engineering/technical phenomenon. ASR in general is for me the exception to the BS/SO rule. That's why I find it so strange in Amirs video, that he emphazises that he did pass thoses tests, explains what his background is. But IMHO stirs away from the (for me) logical conclusions.

A) Feel free to collect and enjoy 24-bit but don't believe there is an "audiophile" reason to do so
B) High bit lossy codecs (AAC, MP3 320kbps) are so close to CD quality that it only makes sense for a tiny tiny minoriety of "audiophiles" to not take advantage of lossy codecs (more music on your device)

It seems to me audiophiles have a tendency to eat reason for breakfast. And if measurements do not take into account the impact on the real world, a quote from an very well known Englisman comes to mind: "Though this be madness, yet there is method in it."

But of course "As you like it." ;)
 
That's what I found a bit disappointing.

It's one thing to be trained to recognize compression artifacts that distinguish (say) MP3 from lossless. It's quite another thing to "distinguish" 16 from 24 bit audio by cranking up the volume in a "silent" passage till you can hear the quantization noise. If you did that "in real life", you'd blow your speakers (and your eardrums) the moment the music came back on.

That's a good way to "pass" the test, but it tells us nothing about the audibility of 16 vs 24 bit in any real-world scenario. In fact, I guarantee that if you retained control of the volume knob, there's no way @amirm would succeed in distinguishing the two.


Quite. And IIRC Amir was called out about such highly contrived 'successes' on hydrogenaudio years ago. Possibly by yours truly.
 
.....That's a good way to "pass" the test, but it tells us nothing about the audibility of 16 vs 24 bit in any real-world scenario......

You are absolutely right....but that was not the purpose of Amir's story. Archimago's test of the audibility of 16 vs 24 bit was finished and the results were already in when Amir took it. Your question had already been answered: people cannot tell the difference between 16 and 24 bit under normal circumstances.

The goal of Amir's story was not to disprove Archimago's results, but to illustrate the circumstances in which blind testing is most able to discern tiny differences. These circumstances would include:
  • knowing exactly what kind of anomalies you are looking for,
  • knowing when these anomalies are most likely to be audible,
  • adjusting the volume and choosing the samples to make them most audible
  • using very short listening intervals, rapidly repeated.
This is useful information to have. How to make blind testing effective matters, whether you are attempting to determine the artifacts in various codecs (which was part of Amir's job), or whether you are trying to prove the audibility of a fancy power cord. If Amir was trying to prove how much more acute his hearing is than the rest of us, he would not have revealed how he did it.

Audiophiles who insist you can only hear differences when you listen over hours or weeks are mistaken...you are least likely to hear objective differences then. Audiophiles who insist blind tests are too stressful to show objective results are also mistaken. Companies like Dolby / Dbx who work on codecs do it all the time.
 
Last edited:
Back
Top Bottom