Limitations of blind testing procedures

Jakob1863 · Feb 21, 2017

oivavoi said:
Good comment. The interesting and challenging question, I think, is in the areas where there are known objective differences in sound or performance, but which nevertheless often get negative results in blind tests. And tests about preferences. For example lossy vs lossless, high lossy vs lower lossy, good amps vs bad amps, etc. Or much reflections vs few reflections, and so on. How useful are blind tests here? Should we be guided by statistical means in preference testing, and/or buy gear which doesn't measure well but which often can't be distinguished from gear that measures well in abx tests? That's the thing I've been wondering about.

Btw: do you have any links to studies on reflections with opposing conclusions?

Controlled listening tests (i still don´t like the label "blind test" as it is only related to one bias mechanism) are just a tool that helps in sorting things out. Independent and dependent variables were defined and in a perfectly controlled test the result would show the true relationship between dependent and independent variables. In reality there is no perfectly controlled test (and one of the golden rules says "the more rigorous the control the less the pratical importance of the results") and therefore some confounders are influencing the results, making it more and more difficult to have only the independent variable true input.
At that point there is another golden rule "block out what you can and randomize what you can not block out" .

But before any consideration about protocols start one has to decide what hypothesis/question should be tested and if any generalization of results should be possible. It helps a lot to be as clear as possible in the verbalization of the hypothesis. Imo that point was quite often underestimated in audio tests (related to differences overall; the psychophysic tests were traditionally related to only one parameter).

So, as a tool, controlled listening tests could help in preference decisions but there is no magic in it, which means it must be provided that the test itself does not have impact on the result. Sometimes a bit difficult to realize as it depends on the EUT and the listener so might need an extended training time under the specific test conditions.
Especially if preference overall is tested - at that point i beg to differ with Blumlein88´s description, as the "is saltier question" is called a directional test and usually tests for a difference not a preference, while otoh a preference test is quite often undirectional - because preference is based on a multidimensional perception .

fas42 · Feb 21, 2017

Everyone has their own perspective, but in my audio world the word "prefer" is non-existent. A system either functions well enough so there are no obvious, audible problems - or it doesn't. Comparing two audibly flawed systems, or variations of the one system is an exercise in futility, for me - I've been at listening sessions where this is done, and I find it mind-numbingly boring, and irritating; all I can hear is where both setups are 'wrong', and this gets in the way of everything else - resolve the deficiencies of both, and then one can properly get on with the supposed point of the exercise: determining a "preference".

oivavoi · Feb 22, 2017

Jakob1863 said:
Controlled listening tests (i still don´t like the label "blind test" as it is only related to one bias mechanism) are just a tool that helps in sorting things out. Independent and dependent variables were defined and in a perfectly controlled test the result would show the true relationship between dependent and independent variables. In reality there is no perfectly controlled test (and one of the golden rules says "the more rigorous the control the less the pratical importance of the results") and therefore some confounders are influencing the results, making it more and more difficult to have only the independent variable true input.
At that point there is another golden rule "block out what you can and randomize what you can not block out" .

But before any consideration about protocols start one has to decide what hypothesis/question should be tested and if any generalization of results should be possible. It helps a lot to be as clear as possible in the verbalization of the hypothesis. Imo that point was quite often underestimated in audio tests (related to differences overall; the psychophysic tests were traditionally related to only one parameter).

So, as a tool, controlled listening tests could help in preference decisions but there is no magic in it, which means it must be provided that the test itself does not have impact on the result. Sometimes a bit difficult to realize as it depends on the EUT and the listener so might need an extended training time under the specific test conditions.
Especially if preference overall is tested - at that point i beg to differ with Blumlein88´s description, as the "is saltier question" is called a directional test and usually tests for a difference not a preference, while otoh a preference test is quite often undirectional - because preference is based on a multidimensional perception .

Excellent comment!

Jakob1863 · Feb 22, 2017

fas42 said:
Everyone has their own perspective, but in my audio world the word "prefer" is non-existent. A system either functions well enough so there are no obvious, audible problems - or it doesn't. Comparing two audibly flawed systems, or variations of the one system is an exercise in futility, for me - I've been at listening sessions where this is done, and I find it mind-numbingly boring, and irritating; all I can hear is where both setups are 'wrong', and this gets in the way of everything else - resolve the deficiencies of both, and then one can properly get on with the supposed point of the exercise: determining a "preference".

Maybe i got a wrong idea about the question that was raised in the first post of this thread, as i thought it was related to typical consumer choices.
Quite often these choices are not about two (very?) different systems but rather the same system in which just one part is changed.

In the situation that you described it can of course be difficult or even impossible to answer a preference question if you have to choese in the quality range of "awfully" vs "unbearable"

.

But that, at least in my experience, isn´t the usual scenario.

fas42 · Feb 22, 2017

Yes, the prefer is usually about a system with one part changed - what I quibble about are the connotations of "prefer", which I believe is the wrong way to go about achieving higher quality sound. In a different sphere entirely, I fixed something that was broken in our car a couple of days ago - but just yesterday we went for a drive and there's a rattle that wasn't there before. Now, it's obvious why - the fix has altered the alignment of internal fittings, two hard surfaces are bouncing against each other. In the audio world, people say, which do you prefer, the car to have rattles, or not to have rattles? Okay, you don't want the rattle, so ... unfix that broken part! Nooo ... that's not the way ... you realise that there is movement forward, so now you 'fix' the next problem - cushioning the movement between the noise generating surfaces.

Unfortunately, in the audio world that type of logical thinking is almost non-existent ... people play games of "preferring", instead ...

krabapple · Apr 6, 2017

Blumlein 88 said:
Pretty simple really. You claim you can jump 15 feet in the air. I will put up a 15 foot marker or pole and ask you to demonstrate. You either can or you can't.

You claim you can hear 30 khz, I'll play you a tone and you either can or you can't.

Or if you are Amir, you report you can hear the difference..if you crank the volume of a reverb tail way up. Or find *just the moment* in the piece and listen with utmost attention to it repeatedly until you're sure.

I 'call foul' on that, because that is NOT what audiophiles and golden ears and Bob Stuart's ad copy and pretty much every 'subjectivist' claims. Differences between A and B, in the vernacular of the audio hobby world, are easy to hear for anyone with the 'right equipment' and the 'right ears' (not the 'right degree of volume cranking' or 'if you listen to just the right moment' and god knows it's never 'the right room' despite the huge role the room plays). The difference are akin to veils being lifted, the differences can be heard even by barely interested spouses, the differences produce 'fatigue' versus 'sonic bliss' , the differences can verily make tulips grow in your garden.

You claim you can hear a difference between FLAC and wav files. I'll put up FLAC and wav files and you pick which is which. You either can or you can't. No leaps of faith either way. That is the simple version, and there is a long and deep backlog of knowledge about what we can and cannot hear. Developed academically over a hundred years or more. There is additional knowledge of how the hearing mechanism is constructed and it fits with what can be demonstrably heard.

You claim you can hear artefacts of reproduction that happens at -160 db you either can or you can't. In addition since brownian motion of air molecules is above that level we can pretty well predict quite reasonably that you can't. If you can demonstrate you do, well that will need investigating.

Oh, and from other fields involving other senses and activities we know how bias, and placebo colors actual perception. Hearing isn't something special that can thwart that fact.

Mr. Blumlein here is exactly right to ask what *you* (you, the golden-eared subjectivist reader, whoever you are) claim to hear and how *you* verified it and why *you* think it matters. That can be tested! Using the same A and B gear and samples that they already claim to sound different. Just add one crucial ingredient: blinding. The idea that the only 'valid' DBTs are academic experiments, those where the most sensitive listeners and the most revealing conditions were used, is nonsense in the context of the claims of the hobby. In the context of the hobby , where all sorts of extraodinary hearing abilities are assumed to be real, we can *certainly* see if a given audiophile's claims hold up...by testing that audiophile. And as the (quite typically) negative results from such tests pile up, the sensible response isn't gee, you should have used a better listener. The sensible response is *gee, I wonder how many audiophiles are just deluding themselves, and what are the implications for audio marketing. journalism, and hobbyism?'

And btw, a positive result, in either a test of a hobbyist, or an academic trial, doesn't automatically mean that you, the random subjectivist reader, can hear that difference too.

Furthermore, when Amir hears a difference between A and B under abnormal conditions (e.g., cranked reverb tail) it doesn't mean he hears a difference between A and B in normal listening.

Lastly, there is no 'rule' that says an ABX (or other blind protocol *including DBTs for preference rather than difference* which absolutely DO exist - e.g., Olive's loudspeaker tests, not to mention routine use in food testing) must forbid long listening samples, or letting the subject switch samples, or adjusting the levels.

Jinjuku · Apr 6, 2017

Cosmik said:
If it's all so simple, then why use music as your test signal and not tones, noise, clicks and bleeps? This is the point where the whole "It's real science" thing falls apart. It turns a scientific experiment into a beauty parade and, as we know, beauty exists only in the eye of the beholder. The result can be that a real difference is masked by the 'emotional' content of the music, and that supposed preferences are really just a response to novelty or fashion. You can launder the results into statistics with six decimal places of course.

1. People make claims and claims are indeed testable.
2. People making hyperbolic claims aren't making claims about single band or sweep frequency.
3. Beauty is in the eye of the beholder, audio is in the ear. You don't need to see to hear.
4. People of the ilk of Lavorgna, Darko, Van Es, Scott's are making an emotional claim and they launder this into as many
$$'s digits as possible. Their reviews are a response to novelty or fashion, but not about true audible characteristics.

Sauce that is tasty on Goose tastes just as good on Gander.

fas42 · Apr 6, 2017

The general nature of what is improved with a better system, in the subjective sense, is that complexity in the sound is much easier to digest, and follow. Listening "testing" with a single instrument, with ideal circumstances for such a recording, is the worst type of signal to use for evaluation - extremely complex, dense recordings, with different aspects of the sound being recorded in different "spaces" is far superior for doing this job. Better playback allows one to clearly hear all the strands making up the whole as separate entities, quite distinct and fully formed - this is as far from a "wall of sound" pop track experience as one can go; the latter may tick the boxes for some people, but I find this type of sound exceedingly boring, as interesting and worthy as a fizzy soft drink.

If one listens to playback attuned to this aspect of the sound then it becomes easy to perceive the differences - an inferior system swamps you with a barrage of "sound"; a better one allows one to comfortably "dissect what's there", to turn one's attention to some element in the the music and follow that in a relaxed, unforced manner.

Cosmik · Apr 7, 2017

Jinjuku said:
1. People make claims and claims are indeed testable.

We are dealing with human consciousness and 'art'.

If someone claimed that a particular audio system could make jokes funnier (no different from claiming that it can convey the "emotion" of music better - I was criticised for saying this was pseudoscientific claptrap), how would we test the claim? People only find jokes funny when they are 'in the mood' - probably not when they know they are being tested; we can only hear a joke once and it will not be funny again, at least until we forget it over time; jokes are like fashion and they change with time and place - they are not universal. And yet people do, indeed, laugh at jokes replayed over audio systems when they are caught 'off guard'.

You cannot show that any listening test involving music (rather than clicks or bleeps - but even these may have some aesthetic content!) is not equally meaningless - it is just not so obvious.

For those of a philosophical bent, it is obvious that using the 'scientific method' does not automatically make what you are doing 'science'.

watchnerd · Apr 7, 2017

fas42 said:
The general nature of what is improved with a better system, in the subjective sense, is that complexity in the sound is much easier to digest, and follow. Listening "testing" with a single instrument, with ideal circumstances for such a recording, is the worst type of signal to use for evaluation - extremely complex, dense recordings, with different aspects of the sound being recorded in different "spaces" is far superior for doing this job. Better playback allows one to clearly hear all the strands making up the whole as separate entities, quite distinct and fully formed - this is as far from a "wall of sound" pop track experience as one can go; the latter may tick the boxes for some people, but I find this type of sound exceedingly boring, as interesting and worthy as a fizzy soft drink.

If one listens to playback attuned to this aspect of the sound then it becomes easy to perceive the differences - an inferior system swamps you with a barrage of "sound"; a better one allows one to comfortably "dissect what's there", to turn one's attention to some element in the the music and follow that in a relaxed, unforced manner.

Well, sure -- a nice 3-way will usually do better than a nice single-driver when it comes to complex material.

But this isn't a limit of blind testing; it's pretty easy to spot in listening tests, blind or otherwise. It also shows up in distortion measurements, waterfall graphs, cone break-up measurements, etc.

watchnerd · Apr 7, 2017

Cosmik said:
(no different from claiming that it can convey the "emotion" of music better - I was criticised for saying this was pseudoscientific claptrap)

100% agree. The best audio system will not magically give high resolution Taylor Swift the emotional depth of low-fi mono Billie Holiday or Maria Callas.

fas42 · Apr 7, 2017

watchnerd said:
Well, sure -- a nice 3-way will usually do better than a nice single-driver when it comes to complex material.

But this isn't a limit of blind testing; it's pretty easy to spot in listening tests, blind or otherwise. It also shows up in distortion measurements, waterfall graphs, cone break-up measurements, etc.

Yes, the "pretty easy to spot, blind or otherwise" is what it's about - I use complex music to "expose the weaknesses" - so it's not "emotional depth" I'm chasing, it's being able to hear what's going on; Taylor Swift material may be trite, but if the background musical filler has interesting things happening in it then it becomes worthwhile as a listening experience.

amirm · Apr 7, 2017

Cosmik said:
If someone claimed that a particular audio system could make jokes funnier (no different from claiming that it can convey the "emotion" of music better - I was criticised for saying this was pseudoscientific claptrap), how would we test the claim?

Pretty easy. Play it for me and if I laugh they are right. Otherwise, we feed them to the wolves. End of discussion.

Jinjuku · Apr 7, 2017

Cosmik said:
We are dealing with human consciousness and 'art'.

Nope. We are evaluating claims.

William Low stated, at WBF, that their Ethernet cables make pronounced difference at all sorts of venues, all the time, all over the world.

This isn't an artistic claim. This isn't an emotional claim. It's a performance claim. One that I offered to show up with some cabling, a layer 3 managed switch and their setup, and let him demonstrate.

Of course AQ has no upside in sitting for a test where, all the time, all over the world, in all sorts of settings, their RJE's make a 'readily apparent', 'easy to discern' difference.

Sal1950 · Apr 7, 2017

Sniff sniff, sniff sniff ???
OH NO, I can smell it. Someone let Frank out of the facility again.

Sal1950 · Apr 7, 2017

Jinjuku said:
William Low stated, at WBF, that their Ethernet cables make pronounced difference at all sorts of venues, all the time, all over the world.

He also listened/watched the rigged AQ HDMI cable video and knew something wasn't kosher, but never pursued it till the objective community called them out on it. Then at that time it disappeared and he claimed no responsibility.

WHO ME???

watchnerd · Apr 7, 2017

fas42 said:
Taylor Swift material may be trite, but if the background musical filler has interesting things happening in it then it becomes worthwhile as a listening experience.

I think your definition of 'worthwhile listening' and mine are very different.

Some people think Yngwie Malmsteen is the world's greatest guitarist because he can shred so fast, while I prefer the understated lyricism of Mark Knopfler.

fas42 · Apr 7, 2017

watchnerd said:
I think your definition of 'worthwhile listening' and mine are very different.

Some people think Yngwie Malmsteen is the world's greatest guitarist because he can shred so fast, while I prefer the understated lyricism of Mark Knopfler.

Everyone's different! ... But I agree about Mark Knopfler.

I've said on a number of occasions that what I enjoy about audio is the texture of the sound, in the air - appreciation of an individual's playing prowess can be done on a car radio; but the interweaving of all the sound strands on a track can usually only be properly groked when the playback is of a very high standard ... and a lot of the pop "thrash" around at the moment has quite a bit of clever mixing of sound styles and patterns - the latter is something I can appreciate ...

Cosmik · Apr 7, 2017

Jinjuku said:
Nope. We are evaluating claims.

William Low stated, at WBF, that their Ethernet cables make pronounced difference at all sorts of venues, all the time, all over the world.

This isn't an artistic claim. This isn't an emotional claim. It's a performance claim. One that I offered to show up with some cabling, a layer 3 managed switch and their setup, and let him demonstrate.

Of course AQ has no upside in sitting for a test where, all the time, all over the world, in all sorts of settings, their RJE's make a 'readily apparent', 'easy to discern' difference.

We may be thinking of different reasons for using, or not using, listening tests. I am wondering how we can 'cut to the chase' of a better audio system and concluding that listening tests are so meaningless that they just confuse matters. You may be thinking of using them to 'fact-check' certain individual (deluded?) claims. Personally, I would evaluate digital systems by (a) just thinking about how they work, and (b) entirely objective measurements.

Jinjuku · Apr 7, 2017

Cosmik said:
We may be thinking of different reasons for using, or not using, listening tests. I am wondering how we can 'cut to the chase' of a better audio system and concluding that listening tests are so meaningless that they just confuse matters. You may be thinking of using them to 'fact-check' certain individual (deluded?) claims. Personally, I would evaluate digital systems by (a) just thinking about how they work, and (b) entirely objective measurements.

We aren't using any different listening tests than what the sighted subjectivists is already doing. @ Polk Audio forums they were going on and on about burned in vs non burned in cabling. I offered to send out two sets of cables, randomly labeled. one set burned in. The offer was 30 days of fully sighted, self administered, 100% control of listening length and how quickly to swap out. So no test administrator? Check. Fully sighted? Check. On the own Equipment? Check. In their own room? Check. Using their own music? Check.

Not a single member there could polk a hole in the testing methodology. One said they would participate and then the others brow beat them into backing out.

I've also offered Bill Low and Michael Lavorgna a controlled session to evaluate Ethernet cabling with teamed network adapters and a LAG on a layer 3 switch that would allow realtime change of cable without break in audio playback.

Take a guess how many holes they could polk in that setup? Guess how many said they were willing to participate? Hint: it's the same number.

Lastly I ADC'd some tracks where I swapped out a $330 3 meter cable ($27.50 foot) and $90 98 meter cable ($0.30 foot) and posted those tracks. Three people actually tried the tracks and in this case no one correctly guessed the # of swaps, the time of swaps, or what cable was being used at what point.

Limitations of blind testing procedures

Addicted to Fun and Learning

Major Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Grand Contributor

Grand Contributor

Major Contributor

Founder/Admin

Major Contributor

Grand Contributor

Grand Contributor

Grand Contributor

Major Contributor

Major Contributor

Major Contributor

Similar threads