- Joined
- Feb 23, 2016
- Messages
- 20,833
- Likes
- 37,769
Well beyond the formula, a good statiscian would like to see 30 or so samples to feel good about filling out a distribution. They'll work with 15, but less than this is iffy, and suggestive at best.
I don't think we're far apart. I am a proponent of blind testing. It's an essential tool. I've participated in them and I've written ABX testing software. But every test has sensitivity limits, and it's important to differentiate precision from recall.... You build up confidence towards one outcome and that is that. It may not be 100% but at some point, it is likely to be correct. Is this a flu or just cold? Is it chest pain or heart attack?
... It is possible that the difference exists but unlikely. They are liable to be better than most people so if they can't hear the difference, what chance does the average Joe have? Shades of gray.
I'm pretty sure they're not. They are saying that if they give it a quick listen and it sounds as expected, then it's OK. In the same way that any professional simply decides on stuff without conducting a survey or controlled blind experiment first. I should think that 99.9% of the man-made world is built that way!It's not even clear from that quote that they are performing these listening tests in a controlled manner.
What is lost, is lost to an expert listener. He is already so much better than others that to them, it makes no difference what that extra bit is. When I find differences between MP3 at 320 kbps and original, and it is hard, then I know no one is going to detect what I am hearing let alone what I am missing.Here's why I think this is more than theoretical: we know that switching delays in ABX tests, even less than a second, reduce sensitivity. This proves there are real differences that are obscured by even a small time delay; even short-term memory is imperfect. Intellectual curiousity makes me wonder what other differences might be lost?
When I find differences between MP3 at 320 kbps and original, and it is hard, then I know no one is going to detect what I am hearing let alone what I am missing.
I tend to think that if a double blind audio test requires training to discern differences that are inaudible to most, and even then it requires a lot of effort to identify differences then it kind of indicates that even where there are differences then they are not that significant. I've seen audiophiles break into a cold sweat with the effort of identifying subtle differences and then pretend there is a night and day difference, well no, if it was a night and day difference people wouldn't be so stressed out about whether or not they were identifying the different equipment.
I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.
I tend to think that if a double blind audio test requires training to discern differences that are inaudible to most, and even then it requires a lot of effort to identify differences then it kind of indicates that even where there are differences then they are not that significant. I've seen audiophiles break into a cold sweat with the effort of identifying subtle differences and then pretend there is a night and day difference, well no, if it was a night and day difference people wouldn't be so stressed out about whether or not they were identifying the different equipment.
I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.
Recently I got my invitation to the Qobuz US beta (192 kbps tier) and have been trying to compare the difference between it and my Spotify 320 stream., I'm finding myself in the same position as you. Very hard to do in any case since near impossible to find a way to level match and be able to rapidly switch between the two. As it stands now I'm finding it hard to justify the minimum of a twice as expensive monthly fee for even the CD tier. Besides that and at least for the music I listen to, Spotify has the largest catalog around.I ripped all my CD collection to MP3, then years later repeated the exercise to FLAC. In many cases I cannot discern a difference, in others I can discern a difference but it takes effort to do so and is subtle, and if I'm listening to music to enjoy the music (perish the thought of it) then it makes no difference. What I do get from FLAC is a psychological boost in knowing that it is lossless and nominally "better" which has an effect in itself I think, and these days memory is cheap but if judging strictly by audio quality alone to be quite honest I'd be happy enough with 320k MP3.
It is more like a rapid A/B comparison as the viewer foveates each item alternatively. In addition, it is probably the same when considering tactile information with the two objects simultaneously palpated in two hands, one probably cannot attend to them simultaneously.Visually, we can perceive two different things simultaneously: hold them right next to each other and look at both.
If you want to compare 192 kbps mp3 to 320 kbps mp3, you can just use a lossless recording, transcode it to both formats and do a DBT with this: https://www.foobar2000.org/components/view/foo_abxRecently I got my invitation to the Qobuz US beta (192 kbps tier) and have been trying to compare the difference between it and my Spotify 320 stream., I'm finding myself in the same position as you. Very hard to do in any case since near impossible to find a way to level match and be able to rapidly switch between the two. As it stands now I'm finding it hard to justify the minimum of a twice as expensive monthly fee for even the CD tier. Besides that and at least for the music I listen to, Spotify has the largest catalog around.
Having read that document last night, I see your point. Even though a negative result from a blind test is inconclusive (absence of evidence is not evidence of absence), you can derive something useful from it when testing a group of people. You can measure which listeners have greater acuity than others. And you can use that to filter your listeners and improve the test sensitivity.I was with you till you said it tells you nothing.
The issues you state are true but we have tools to deal with them. ...
This seems to confuse test sensitivity with listener sensitivity. Imagine 2 people X and Y with equal hearing acuity. But X has better short-term memory. This is plausible since we know people vary widely in memory performance, both short term and long term. In this hypothetical scenario X and Y both hear the same differences, but X has a more accurate and detailed short-term memory to compare it with, thus out-performs Y on the DBT.What is lost, is lost to an expert listener. He is already so much better than others that to them, it makes no difference what that extra bit is. ...
Possibly! Though if so, the switching is more rapid than it can be with audio testing which requires absolute sequential separation. And we know through testing that even small time delays have a measurable impact on test sensitivity.It is more like a rapid A/B comparison as the viewer foveates each item alternatively. In addition, it is probably the same when considering tactile information with the two objects simultaneously palpated in two hands, one probably cannot attend to them simultaneously.
Yes, but switching can be instantaneous, to a few milliseconds, possibly faster, so one may not even perceive that switching has happened unless there's a change in sound quality. When doing AA AB BA BB testing, and recording whether the change is same/different, that sort of short switching time should not be any problem for anyone, however short their audio memory. They can switch back and forth as often as they like, as they only have to identify whether the two are same/different. It's only once it has been established that a difference exists is there any point in going further with preferences or trying to identify what the differences are.Possibly! Though if so, the switching is more rapid than it can be with audio testing which requires absolute sequential separation. And we know through testing that even small time delays have a measurable impact on test sensitivity.
Having read that document last night, I see your point. Even though a negative result from a blind test is inconclusive (absence of evidence is not evidence of absence), you can derive something useful from it when testing a group of people. You can measure which listeners have greater acuity than others. And you can use that to filter your listeners and improve the test sensitivity.
However, you still cannot detect (let alone correct for) false negatives. And the evidence we do have suggests that these false negatives do exist. This evidence is that inserting even small (less than 1 second) switching delays reduces test sensitivity. So even short-term memory is imperfect, to a measurable extent. Yet when we perform a blind test, even with instantaneous switching we still rely on memory because we can't simultaneously hear A and B. We are always comparing one with our recent memory of the other.
The practical takeaways are:
We know that audio tests have limited sensitivity, because they rely on memory which is time-sensitive even for fast switching.
Whatever threshold of audibility we measure in blind tests, is not the threshold of inherent hearing acuity; it is the threshold of test sensitivity.
We can reasonably assume that inherent hearing acuity is an even lower threshold (to assume otherwise implies perfect short-term memory, which is implausible).
We don't know how much lower that threshold is because we can't detect the false negatives.
So:
Equipment makers using blind tests should add a safety factor to the minimum thresholds detected in tests. How much to add is up to their discretion.
People who express a preference between A and B, but can't differentiate them in a DBT might be victims of expectation bias or other psychological factors, or they might also be hearing a real difference that is lower than the test threshold but higher than the acuity threshold.
.. When we've gotten this close to thresholds they are already well past what someone could pick up on in casual listening without a reference. Or polluted sighted long term listening. That is already a margin of safety vs the normal use of the audio equipment for listening to music.