• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Audio Blind Tests and Listener Training (video)

Well, what a fantastic video.

Kudos to Amir in one particular area, before I go any further. What he was saying about his training, he could have come across as very arrogant and conceited, but he was very careful and patient enough to lay out his abilities without appearing so. So thanks for that, Amir.

Now, all of the content of the video, yes, fine, all correct I'm sure, well done. But wait. I know some here consider hi-fi in itself to be a hobby in its own right, and might even want to sign up to courses to try and gain at least some of Amir's insights, and then go home to test themselves.

But we're not all like that. For some (I hope most), we just want to listen to our favourite music and for it to sound as good as possible. In most cases for as little money as possible. Music is our hobby, and hi-fi a means to an end.

So whether or not Amir can spot the differences using the methods explained, I'm not sure it actually scratches that itch.

I think what I'm saying is...if Amir can tell the difference between a particular level of distortion only when listening to a 0.5 second clip without the distortion played several times, then with the distortion several times in quick succession, then whilst it's excellent that there's someone out there who can do that, in other ways its not particularly useful for those of us who just want to enjoy the music.

The differences we need to know about are the ones we may well hear in 'normal listening'.

So take a track that you know - let's say, for the sake of argument, The Four Seasons December 1963. And play it once with the distortion (or whatever), and once without. Does the one without the distortion sound better or not?

Because, if it doesn't, I'm not sure what being able to tell the difference between two 0.5 second clips of the same piece of music played repeatedly, and in quick succession, actually tells us.

Indeed, I'll flip that. I think the opposite is true. The fact that Amir can tell the difference by repeating a clip, but not by listening to the whole track, is an indication that the difference is too small to note. Too small to be basing a hi-fi purchase decision on.

When a difference is so small that it can be heard in (let's call it) artificial listening, but not normal listening, then THAT'S got to be our cut off point. If Amir's excellent and trained hearing can hear it in an artificial environment, but not a normal environment, then that's not a difference of any interest. I'm not going to buy that piece of kit.

Do I have that wrong?

I want to know where this cut off point is. When Amir can tell the difference when listening to snippets, and then identify it it normal listening, then that's got to be where a piece of equipment or a file becomes problematic.

If it's possible for Amir to hear an issue this way with my current amp or DAC, but not able to identify it on a new piece of kit, then that's when I'll consider upgrading.

But why would I want to upgrade if the only way Amir can tell a difference is if he cuts a 0.5 second snippet out of a piece of music and listen to it several times in a loop?
 
Last edited:
Do I have that wrong?

I think you’ve got it spot on.

Listening to a 0.5s snippet of music in a loop to try to hear a difference may be just another day at the office for Amir. For anyone else it’s the first sign of a mental health problem.

I can’t recall if he covered this but I would like to know if Amir could hear any difference between 24 bit and 16 bit and even mp3 on his main system under normal listening conditions. If he can’t, the rest of can give up trying.
 
Amir - excellent and quite informative review with great tips on how to listen critically. On the first Foobar example was there a typo? Both sample one and two said 24 bits - was one of them supposed to be 16 bit? I couldn't see any difference in the two titles to see what was under test.
Thanks.
 
In a normal room, with normal music, under non-pathological conditions, I think most would be very hard pressed to tell a difference beyond that. Most well below.

That said, a CD has 96dB of range, so the base standard seems to be whether it is capable of giving you whatever there is to be had from a CD or whatever source. At some point, it is more about boundary case issues than veils being removed or hearing that triangle for the first time, unless you are starting with a really crap DAC.
This is where I disagree with most people here. I don't think something can be said to be subpar when there are very rarely going to be people who even notice the differences. Subpar to me, is a really crap DAC. Like I can hear it without trying to much, a entry level DVD player with audible issues is easy to find, why don't we measure that and see what and audible difference measures like?
 
Amir - excellent and quite informative review with great tips on how to listen critically. On the first Foobar example was there a typo? Both sample one and two said 24 bits - was one of them supposed to be 16 bit? I couldn't see any difference in the two titles to see what was under test.
Thanks.
My pleasure. No there is no typo. 16 bit file was also presented as 24 bit file so you could not cheat by just looking at file properties. The effective bit depth however was 16.
 
Comments on confidence and p-values...not trying to be a stickler, a big part of it is thinking this through for myself. A few years ago, a mastering-engineer friend I was corresponding with on the subject of dither did an ABX test and felt it determined that he could hear something that I didn't think was hearable under normal circumstances (and he wasn't sure either, initially). He pointed out that his results were very close to p = 0.05, the standard for determining is something could not have been guessed, therefore it was highly likely he could in fact hear the difference. I said, "interesting—can you give it a few more tries and see what happens?" He said he didn't want to—it might mess up his score. There were two things I took from that: People have a lot of confidence when they hit that value (even when he didn't, exactly, but was close)‚ and...wtf does this significance level really mean anyway?

To get to my point, if I give you an ABX test to see if a difference can be heard, and you score high (p <= 0.05), my confidence should not be too high that you can hear the difference. If I have you take the test 100 times and you average a high score, then I should be very confident you can hear the difference. Or, if I give the test to 100 people and the average is a high score, I should be very confident that people, in general, can hear the difference.

It's that latter case that we're almost always testing when we put out a public challenge—we offer the test to as many takers as we can get. But while such a challenge is put to a group, they are ultimately framed as an individual challenges. In other words, if I offer an ABX test to see if people can hear the difference between two files, and 100 people take the test, one or more are likely to get a high-confidence score. Those people will reply back that I have been proven wrong—clearly there are people who can hear the difference. The flaw in that deduction is that if I were to secretly make both A and B the same sound file source, odds are someone would still get a high score and the satisfaction they have golden ears, when in fact it was luck.

In other words, it's not profound that Amir gets 8-of-10 to 10-of-10 score on a given test. It is profound that he can do it test after test (with something that's hearable—we can allow that he'll fail on un-hearable differences, because we're nice folks). (Now, maybe there are hundreds of attempts he's not telling us about...j/k :p)

As I mention in a related post (High Resolution Audio: Does It Matter? #615), I find that most public tests are framed wrong. For instance, if the question is whether people can hear the difference between 44.1k and 96k, and is put to random population, I'd expect the answer to be no. Still, the conclusion from that can't be "no one can hear the difference", just that people on average can't. If you wanted to know whether some people can hear it, that's a different test—you probably want to go with serious audio people who claim they can hear it, as test subjects. If the result show no, it still doesn't mean no one can hear it, it just means these audiophiles in general can't hear it. If you want to determine whether a particular person can hear it, you probably need a relatively large series of tests on that one person.
 
Hearing it is one thing. Noticing it during music is an entirely different animal and would not occur near the same detection level.
 
I dare to address the pink elephant in the room. Amir is „cheating“ what he is doing eg listening to tiny artifacts while turning up the volume to 11 (Spinal Tap forever;)) has absolutely nothing to do with evaluating the sound quality. He is right in proving the point that given his extensive technical expertise he can pass those tests. But he acts like a pest controller who swears that he can kill all rats in New York City then drops a couple of nuclear bombs and tells the baffled audience: I told you I can do it! The way he passes those tests has no real world implication. And IMHO that is quite obvious.
 
Last edited by a moderator:
I dare to address the pink elephant in the room. Amir is „cheating“ what he is doing eg listening to tiny artifacts while turning up the volume to 11 (Spinal Tap forever;)) has absolutely nothing to do with evaluating the sound quality of music. He is right in proving the point that given his extensive technical expertise he can pass those tests. But he acts like a pest controller who swears that he can kill all rats in New York City then drops a couple of nuclear bombs and tells the baffled audience: I told you I can do it! The way he passes those tests has no real world implication. And IMHO that is quite obvious.

In the first place, Amir is NOT evaluating the quality of music, nor does he pretend to do so. He is evaluating the characteristics of software, systems and equipment. Then he tells us what he found out. We can make our own judgements from there.
To do that, a certain amount of overstress is involved. Want to evaluate a car engine? Over rev it, and have a pro do things with it on a track that you would never do.. Very common.
That's what testing is. There is no "cheating" involved.

Want to test a concrete culvert? Stress it to destruction; very common.
That's what testing is. There is no "cheating" involved.

Want to evaluate paint? Leave it in the worst possible environment for "X" number of years. If two samples look alike, hit each one with a wide-spectrum camera; the differences will show up.
That's what testing is. There is no "cheating" involved.

How would you like to drive around town in a car that went through government crash tests ...... by braking just before impact? That's what you would do if you were to get into a collision, but the "cheating" that governments do in destructive crash tests go far beyond what you would do in your daily life.
That's what testing is. There is no "cheating" involved.

Testing extremes benefit median uses.

If the tiny differences that he presented to us were found only by instrumentation, I think you would not have said that there was an "elephant in the room." But he did it using the human factor; his brain and his experience.

I think that's what bothers you. Jim
 
In the first place, Amir is NOT evaluating the quality of music, nor does he pretend to do so. He is evaluating the characteristics of software, systems and equipment. Then he tells us what he found out. We can make our own judgements from there.
To do that, a certain amount of overstress is involved. Want to evaluate a car engine? Over rev it, and have a pro do things with it on a track that you would never do.. Very common.
That's what testing is. There is no "cheating" involved.

Want to test a concrete culvert? Stress it to destruction; very common.
That's what testing is. There is no "cheating" involved.

Want to evaluate paint? Leave it in the worst possible environment for "X" number of years. If two samples look alike, hit each one with a wide-spectrum camera; the differences will show up.
That's what testing is. There is no "cheating" involved.

How would you like to drive around town in a car that went through government crash tests ...... by braking just before impact? That's what you would do if you were to get into a collision, but the "cheating" that governments do in destructive crash tests go far beyond what you would do in your daily life.
That's what testing is. There is no "cheating" involved.

Testing extremes benefit median uses.

If the tiny differences that he presented to us were found only by instrumentation, I think you would not have said that there was an "elephant in the room." But he did it using the human factor; his brain and his experience.

I think that's what bothers you. Jim
Sound Quality. Was what I meant. I did edit my post. And I dont think that your examples apply here. There is a total difference between the test of an engine or a paint or extreme tests for an airplane and sound quality.
 
I dare to address the pink elephant in the room. Amir is „cheating“ what he is doing eg listening to tiny artifacts while turning up the volume to 11 (Spinal Tap forever;)) has absolutely nothing to do with evaluating the sound quality of music.
I do not turn up the volume in many tests to hear the differences. It was the case in that one test (16 vs 24 bit) because it related to fade out to silence and difference by definition would be at the lowest volume.

I also don't try to advocate such successes when they don't make sense. For example, I tested a Topping DAC against Schiit Yggdrasil. I recorded very faint music and then amplified the output a ton. The difference was extremely clear between the two. I did not trumpet this test however because I could not justify it as being representative of real situations.

My definition of cheating is going outside the rules of the tests. In the 24-bit vs 16, no condition was put on listening volume so what I did was perfectly fine. I am sure when Archimago created the test, he didn't think listening level would change the results. Maybe he does now. :)
 
Hey Amir, thank you for your reply. Would you say there is a meaningful audible difference between 24 and 16? I with limited knowledge would say: No! Thank you.

"My definition of cheating is going outside the rules of the tests." That is why it did put "cheating" in quotation marks.
 
Last edited by a moderator:
Sound Quality.... And I dont think that your examples apply here. There is a total difference between the test of an engine or a paint or extreme tests for an airplane and sound quality.
The extreme tests you describe are things that over engineered for safety. Overengineering is very expensive and to overengineer for sq is stupid, unless you expect it to be used that way. But no one ever wants to experience pain while listening to music, so they never do that.

Unless you mean production, which is a different animal, where the sound is amplified and manipulated in ways that if there was noise it could become amplified. That's why pro gear has balanced cables and we don't.
 
There’s a difference between (a) is it possible that I might be able to hear a difference in specific, non-real-world instances, and (b) is it worth me replacing my (for example) £250 DAC that I bought 6 months ago with a new £250 DAC, if the difference is that small?
If you can't as easily put the money in an astray and burn it, once you have a great dac you are set.
For your system to be so transparent that another dac would be a major factor would mean you aren't worried about 250 lbs.
 
Incorrect. Testing finds and defines limitations. It's a protocol and a mindset. To what you apply those protocols and that mindset, and for what purpose, is totally beside the point. Jim
For me testing is means to an end. If you listen to tiny fragments of music and turn up the volume to extreme levels than for me that has nothing to do with evaluation the sq of 16 vs 24.
 
I tend to agree with Chrisxxx26 here. A test of audibility has to be confined to a realistic situation.

Otherwise:
  • we can all distinguish 16 bit vs 24 bit by playing a LSB signal and amplifying the output by 100 dB
  • we can all distinguish 32 bit vs 24 bit by playing a LSB signal and amplifying the output by 150 dB
  • you get the picture
The whole point of discussing audibility in audio programme playback is to know whether, when playing audio programme that peaks in the range 0 to -6 dB (ie not recorded incompetently), at any level up to Very Friggin Loud (there is probably a loudness SPL level that optimises audibility and anything higher causes the ears (and mind) to protect themselves and lose sensitivity to fine details), Tech A vs Tech B can be distinguished.

cheers
 
20 years ago this was done:
(I assume todays eg AAC @HQ is superior?)

„Reminiscences this time, our comparative listening took place exclusively in the publisher's studio room, whose damping, reflection and resonance behaviour corresponds to an 'audiophile' living room; some readers will remember the room from the days when the magazine HIFI-Vision was published by us. At that time, the ceiling was suspended with diffusors (sand-filled plastic tubes), and additional damping elements on the walls and a bookshelf filled with 'jagged' material ensured dry acoustics. However, the listening conditions found at the time at HIFI-Vision magazine could not be completely reconstructed: instead of HiFi magazines on the shelf, telephone books from the publisher's programme had to suffice as - acoustically effective - staffage. Our ex-colleagues may forgive this inaccuracy ... We used a pair of B&W Nautilus 803 as our top class system, fed by a Marantz CD14 CD player and PM14 integrated amplifier. With the Straightwire Pro cables and accessories used, this combination comes to around DM 30,000 - an amount that few hi-fi enthusiasts are likely to have to spare for their hobby. The Nautilus speakers from this fine English manufacturer are also popular in studios and mastering rooms because of their well-balanced, analytical and not very colourful sound. Furthermore, Axel G (no relation to the editor-in-chief Detlef G, who also took part out of competition) from Sennheiser donated the electrostatic reference headphones Orpheus including the corresponding tube amplifier - unfortunately only for the duration of the test, because the noble small series product was the most expensive component of our equipment at 20,000 DM.

Four minutes from a random selection of demanding pieces of music, the listeners were each played a selected passage of about one minute in length; first as a reference from CD, then in a scrambled order - encoded with 128 and 256 kBit/s or again from CD. For these three recordings, the task was now to determine the correct origin of the performance and to note it on a questionnaire. For a correctly identified 128 kbit/s encoding, one point was awarded per piece of music, as well as for a correctly identified CD recording. For three correctly identified versions, the respondent received three points, but none at all if the 256 kbit/s recording was correctly marked, but the opposite qualities CD and 128 kbit/s were interchanged. A total of 51 points could be scored, whereby the statistical random mean was 14.1 points due to the unequal weighting. Those who achieved more than this number of points had actually heard differences in quality. In order to exclude sound shifts caused by different D/A converters in CD and MP3 players, we had exported the MP3 test files encoded with MusicMatch in Windows version 4.4 in joint stereo mode on a Power Mac G3 via Apple's QuickTime Player into AIFF format and then burned them in scrambled order together with the ripped CD audio files onto a common audio CD.

Listening Festival

Already at the break after the first half hour of concentrated listening, some of the test persons wanted to give up: 'A lottery game', was a comment heard several times. Many test listeners were surprised at how good an MP3 recording can sound through the Marantz player's excellent D/A converter. They talked shop about phase relationships, the influence of (not perfect) room acoustics and personal listening habits, argued about the importance of good cables or praised the superiority of analogue recordings on vinyl - which unfortunately could not be heard due to a lack of comparable recordings.

During the break and after the official part of the event, some of the participants had the opportunity to examine and classify individual pieces with the Orpheus reference headphones. For direct A/B comparison, the participants were allowed to jump back and forth between the individual versions at will, which for understandable reasons had to be avoided during the joint listening.

Award ceremony

The unofficial winner with 26 points was our 'reference listener' Gernot, who had to admit after more than an hour of intensive listening: 'That was hard. It almost seemed to me that some of the 256-kBit recordings sounded a bit rounder and more pleasing than the originals from CD. You couldn't let that put you off.' In fact, CD was marked by mistake instead of 256 kBit/s remarkably often.

Among the invited readers, Mirko, a student of electronics development, who, according to his own statement in his application, 'can guess the sound of an audio circuit just by looking at it', scored 22 points. Under the given conditions - unfamiliar acoustics, stress of success, unfamiliar equipment, suboptimal listening position - a quite respectable score, which earned him the first prize of our competition: 1000 Marks in cash.

We were a bit surprised when we heard about his musical preferences: 'Actually, I cheated a bit on my application. I have classical piano training, but as an active recreational musician I tend to do punk rock.' For the test, of course, he practised hard with different MP3 qualities: Most recently, he had achieved a hit rate of 90 per cent with 128-kBit encodings, despite a handicap: 'Since an explosion accident, I can only hear up to 8 kHz on the left, and until recently I had persistent tinnitus on the right. Nevertheless, I get the typical flanging effects of the MP3 filter banks, and even better than my friends - maybe even because of my hearing damage.'

There's something to this: the psychoacoustic model of MP3 coding assumes a normal-hearing person. Someone who only perceives frequencies up to 8 kHz will not hear a bright cymbal or triangle beat - but they will hear the control noise of the filters in the lower frequency ranges, namely when the parts that are actually hidden by the high-frequency sound are subtracted. Steep notch filters, such as those used in MP3 decoders, actually produce a flanging or 'jet' effect when tuned quickly.

So it is not the perfect ear, but the ear that deviates greatly from normal hearing that seems to be particularly sensitive to the MP3 artefacts. The psychoacoustic masking effects underlying the algorithm - the alarm clock continues to tick even when it rings - also apply to the control noises that occur, which are normally masked by the useful signal. If the latter falls by the wayside, for example because of a hearing impairment, the artefacts are perceived first.

Two together

With 20 points, Jochen and Tom from Nuremberg took joint second place, followed by Martin from Hamburg. He himself owns the big B&W Nautilus 801 and has spent 40,000 Marks on his stereo system 'out of a deep love for music and without falling prey to any ideology'. Dipl.-Ing. Tom is a hearing aid developer, works on audio signal processing algorithms and is used to 'paying close attention to processing artefacts and sound differences during intensive sound tests'. Jochen already had the opportunity to get to know Advanced Audio Coding and other MP3 successors at Fraunhofer IIS in Erlangen.

Statistically speaking

The collected data is not enough for watertight statistics, but it certainly provides interesting insights. We wanted to find out for which pieces of music the uncertainty about the source was particularly high and for which titles most listeners were right. From the mere addition of the points scored by all participants for a particular title, one already gets an indication of whether the recording is rather good or bad for the assessment of MP3 quality (see table of scores).

Classical recordings are by no means always advantageous: for some pieces, the scores were even consistently wrong. For example, far more than half of our testers liked the Arabian Dance from Edvard Grieg's 'Peer Gynt' best in the 128-kBit encoding; obviously, the compression corrected slight weaknesses in the recording, such as a roughness in the woodwinds. Chic's 'Jusagroove', on the other hand, a very dynamic, dense funk track, was correctly classified by most listeners.

To get to the bottom of this phenomenon, we broke down the test results further: Of course, we were interested in the reasons for such difficulties. Did the testers have problems distinguishing between high quality MP3 (256 bit/s) and low quality MP3 (128 bit/s), or did MP3 sound better to their ears than CD?

To find out, we used a slightly different evaluation method. According to common prejudices against MP3, one would expect MP3/128 to sound the worst, MP3/256 to occupy a middle position and the audio CD to deliver the best sound. Thus, any sound sample judged to be 128 kbps MP3 received one point; the guess 'MP3 with 256 bits/s' resulted in two points, and the assessment CD received three points - regardless of what sound quality was actually played. If a test listener could not recognise any difference in a piece, we rated all three sound samples as CD quality with three points.

Now we added up the points for each piece in each sound quality across all test listeners. If all 14 people had always guessed correctly, the picture would always be the same for all pieces: 14 points for MP3 with 128 bit/s, 28 points for MP3/256, 42 points for the CD. In fact, the picture was quite different: especially for the tracks where our test listeners were particularly often off the mark, the assessed quality of the CD was consistently lower than that of the MP3 samples.

The biggest surprise, however, was when we added up the points collected for all tracks and obtained a value for each of MP3 with 128 bit/s, MP3 with 256 bit/s and CD-ROM: MP3/256 and CD achieved exactly the same score of 501 across all tracks and test listeners; MP3/128 was significantly lower at 439. For those interested in statistics: The values 439 and 501 are statistically significantly different with a probability of error of one percent (in scientific studies, one is often satisfied with five percent); there is of course no difference between CD and MP3 with 256 kbit/s (with exactly the same value).

Conclusion

In plain language, this means: Our music-trained test listeners were able to distinguish the poorer MP3 quality (128 kBit/s) quite accurately from the other two audio samples; however, no difference was discernible between MP3 with 256 kBit/s and the original CD on average across all tracks: The testers rated MP3/256 as CD quality just as often as the CD itself.

The fact that some 128 kbps recordings were consistently rated better than the CD originals by the competent listeners (and also by the 'best' among them) amazed even the editor involved, who - as he confesses to his shame - (and without taking part in the evaluation) had only achieved 15 points. In conclusion, there is no one genre of music that lends itself particularly well or particularly poorly to compression. Obviously, it is quite different recording-technical conditions that later take revenge when bit rates are too low.

This article will not end the discussion about the sense and nonsense of MP3 compression. Hi-fi fans with brand and status consciousness will never listen to MP3s, no matter how many tests and studies will prove the equivalence of the experienced sound image. Doubters ('All sissies, I would have heard that for sure') may reach for encoders and CD burners to subject themselves to the 'Pepsi test'.“
 
Good write up. Worth noting that MP3 encoders improved in the years since 20 years ago, when those tests were done, so MP3 encoded today at the same bitrate would be harder to identify from source than it was in that write up.

Also, MP3 has been surpassed in audio terms. In fact MP3 is officially obsolete. Modern codecs like AAC at a constant bitrate of 128 or 256 kbps would be much harder to identify from uncompressed source music than MP3 was in that write up. Although nobody today would use constant bitrate at a medium-low bitrate like 128 kbps, so using variable bitrate AAC would improve the audio quality again (ie making it even more difficult to detect than constant bitrate), if the target file size was the same as 128 kbps constant bitrate MP3 file sizes in 2000.

cheers
 
Last edited:
we can all distinguish 16 bit vs 24 bit by playing a LSB signal and amplifying the output by 100 dB
No such effort was made whatsoever. I simply turned up the analog volume control from normal listening level to a bit higher. This was using my laptop. No way it had 10 dB of headroom let alone 100 dB. Indeed, I explained that I discarded an artificial test where I used software amplification (which I think was 30 or 60 dB). So please don't make up scenarios like this.

we can all distinguish 32 bit vs 24 bit by playing a LSB signal and amplifying the output by 150 dB
If you mean software amplification, then you are violating the test by playing a different set of files than what was provided. So that would be a clear cheat. If you mean you have the ability to provide analog amplification of 150 dB, then you must be living in a different universe than the rest of us!

However, this is a real scenario to the extent to you do have much amplification. Any noise you hear then is an artifact that is audible and means the format you are playing is not transparent.

I suggest you actually try to see if you can pass the test as I did. That would train you for what to listen for which was the purpose of the video.
 
Back
Top Bottom