The frailty of Sighted Listening Tests

Deleted member 17820 · Aug 13, 2020

solderdude said:
Have you ever heard 2 speakers (or 2 headphones for that matter) sound exactly the same ?
Why would it be impossible to tell these apart ?
I don't get it... even with the same 'preference' score (a single number) one could have certain properties that stand out and when you hear another model, with the same rating, could not have or has a different 'thing' in its character.

Very similar does not mean the same. What's so hard about that. Amir gave his opinion. I assume the best in people so reckon that was what he found.
The K361 and K371 measure quite similar but they certainly don't sound the same and once you know what specifics of a headphone are you can hear it when they are missing.

Its not like differences between amps or DAC's we are talking about speakers and human evaluation etc.

Every Bose I've heard sounds the same!

Sean Olive · Aug 13, 2020

amirm said:
Let's also remember that there is a pandemic out there so if there ever was the wrong time to bring people to listen to speakers, it is now. I am pretty sure having me alone rate speakers blind would be dismissed out of hand as it has been.

Yes, the pandemic has been a real impediment to doing any kind of listening tests, which makes objective measurements that are predictive of sound quality even more important.

Sean Olive · Aug 13, 2020

Rusty Shackleford said:
I will let Mr. Olive respond if he wants, but what Amir said earlier in this thread was something quite different.

Amir claimed that Olive’s study was not up to “exacting standards” and was merely about “solving an internal problem [at Harman] where people did not want to believe in controlled testing,” meaning that it couldn’t tell us anything about trained versus untrained listeners’ abilities in sighted versus blind listening.

It’s axiomatic that if the study were done with non-Harman employees, any company loyalty would be removed. But as Mr. Olive said in his post above, that is not the only bias.

It’s also axiomatic that a single study cannot be generalized to all sighted listening. That’s different from saying, as Amir did, that the study’s only purpose was to settle an internal dispute in Harman about the value of blind listening.

No one has claimed that Olive’s study can “prove what non-Harman employees would have done.” The crux of the debate was whether Amir’s training obviated blind listening’s superiority in evaluating speakers. The study was cited to demonstrate that blind listening is still superior to sighted listening, even with trained listeners.

I do think that Racheski is correct that Amir’s position seems to have been that the study didn’t apply to him, because he has greater listening skills and ability to shake biases than Harman’s trained listeners. I’ll leave it to others do decide if they are persuaded by that. I am not.

One way to indirectly compare blind vs. sighted tests is to look compare ratings of products rated internally at Harman using controlled tests versus reviews done by magazines/ reviewers which are 99.9% sighted. Of course, there are many flaws in such comparisons because the test conditions and listener's expertise and hearing are widely different from those in our tests.

Still, it is an interesting exercise to see if trained/untrained listeners in controlled tests come up with the same conclusions as professional audio reviewers. Recently I looked at magazines that review headphones and compared their ratings of the same headphone we had tested in our labs using controlled tests. In this case, I used our predicted ratings of headphone. The predicted preference rating is based on a model we developed using actual listener data from blind tests.

The correlation is generally quite poor ( r = 0.5) but ranged from 0.75 (Consumer reports) to 0.17 for PC magazine. Rtings is the only site that actually uses predicted ratings based on measurements -- they don't do listening tests. The other magazines use 1-2 listeners to come up with the subjective scores.

It gives you a sense of how variable scores can be of products when the listening is done in an uncontrolled way. And if you rely on one of these magazine sources for choosing a good sounding headphone, you are rolling the dice.

Something to think about. This data was reported in this AES paper: https://www.aes.org/e-lib/browse.cfm?elib=19774

valerianf · Aug 13, 2020

Well I can only bring my own experience of speaker testing in shop listening rooms.
I once went to a HiFi shop to get some information about speakers.
Fortunately the owner had some Monitor Audio speaker towers ready to listen at.
The room was not an audio listening room at all.
The amplifier was unknown to me as the HiRes audio source.
But it was one of the best audio listening experience in my life: I sat there 1 hour and listen carefully to the amazing sound stage.

My advice for choosing a speaker is simple: bring with you a hi-res soundtrack that you know very well.
Try a speaker pair where ever it is possible.
Listen carefully a long time, at least 20 minutes.
If you are not impressed, go away and try another speaker.
If you are impressed, go away and come back another day for a second listening session that will confirm/contradict the first one.

The only issue is that nowadays Internet buying is the trend, not allowing anymore this listening selection.
Hopefully Amir is providing us with his listening judgment.
This is an important part of the test for me.

Sean Olive · Aug 13, 2020

patate91 said:
If you had to revisit your blog's article about sighted evaluation (the one that I shared), would you change, add or remove something?

I think my opinion stated here is much the same opinion I held when I wrote the blog posting. I do think sighted tests have their role in audio, and some are more useful than others when the right controls are in place.

A trained listener can provide useful data about the spectral/spatial/dynamic/distortion attributes of the product that an untrained listener would have difficulty providing. The more standardized the feedback is so we understand it in the proper context, the more useful it is. Reading poetic prose about the pace, rhythm and musicality of the speaker is utterly useless to me.

valerianf · Aug 13, 2020

"A trained listener can provide useful data about the spectral/spatial/dynamic/distortion attributes of the product ".
I fully agree with this sentence, It could be a professional musician or sound engineer.
No need for blind listening.
But there is something else more important: the company that is hiring the trained listener needs to be free of any capital/advertising link to the speaker manufacturer.
That is where ASR is a good example.

solderdude · Aug 13, 2020

SoonHappy said:
Every Bose I've heard sounds the same!

solderdude · Aug 13, 2020

Sean Olive said:
Rtings is the only site that actually uses predicted ratings based on measurements -- they don't do listening tests.

Sam Vafei told me they do listen to the headphones to kind of check if the measurements validate what is heard and that is shortly described/used in the review, but indeed not taken into account on the number generating part.
The preference ratings are very close together and, as expected, not shared by audiophiles that put an emphasis at certain aspects and care less for other aspects. These folks thus rate a headphone different in the same usecase because of their own weighting/preferences.
A big plus for Rtings is that they rate a headphone for several use cases by using different weighting of the measured aspects. They did so after several complaints (which were valid) and are still refining the scoring/weighting.

MattHooper · Aug 13, 2020

Speaking of the Frailty Of Sighted Listening, you have members over on audiogon discussing the sonic differences between the screws used in their speakers. Replacing the brass screws with copper screws in Tekton speakers renders "Notable increase in transparency, more coherency."

Yeesh.

Blumlein 88 · Aug 13, 2020

MattHooper said:
Speaking of the Frailty Of Sighted Listening, you have members over on audiogon discussing the sonic differences between the screws used in their speakers. Replacing the brass screws with copper screws in Tekton speakers renders "Notable increase in transparency, more coherency."

Yeesh.

Is that just copper screws or audiophile OFC grade copper screws. I would think the latter go beyond coherency of sound to a holistic musical experience.

I personally prefer Russian military grade aged titanium from cold war era scrapped fighter plane air frames. Much better internal damping than copper.

Kal Rubinson · Aug 13, 2020

solderdude said:
Have you ever heard 2 speakers (or 2 headphones for that matter) sound exactly the same ?
Why would it be impossible to tell these apart ?

Have you done this blind?

Sean Olive · Aug 13, 2020

solderdude said:
Sam Vafei told me they do listen to the headphones to kind of check if the measurements validate what is heard and that is shortly described/used in the review, but indeed not taken into account on the number generating part.
The preference ratings are very close together and, as expected, not shared by audiophiles that put an emphasis at certain aspects and care less for other aspects. These folks thus rate a headphone different in the same usecase because of their own weighting/preferences.
A big plus for Rtings is that they rate a headphone for several use cases by using different weighting of the measured aspects. They did so after several complaints (which were valid) and are still refining the scoring/weighting.

I'm not knocking Rtings. I appreciate the valuable work they do. In the past few months, I believe they have adopted the Harman Target Curves for both IE and AE/OE headphones when they calculate their scores. although above 1-2 kHz it varies to compensate for the different test fixture they use (Head Acoustics vs. GRAS). They also calculate the SQ rating different than we do, giving some weight to other factors. So the scores in my graph may not reflect the changes they've made.

whazzup · Aug 13, 2020

Rusty Shackleford said:
I don’t think anyone doubts that someone who meets Olive’s definition of a trained listener is, on average, likely to be a much more reliable listener than an untrained listener. The How to Listen training is also designed to make sure those trained listeners are also more accurate (valid) in their observations

Nor, I think, is there any debate that the differences between a poor measuring speaker and a well-measuring speaker would be obvious to even many untrained listeners when heard side-by-side.

The issue at hand, IMHO, was whether sighted listening of two very similar-measuring speakers weeks apart — rather than side-by-side, level-matched, etc. — was likely to elicit meaningful differentiation, even by trained listener.

There was the further question of what the definition of a trained listener is, which Olive answered. There also was the simple observation that other reviewers, whom people on this board may not like, might meet the definition of a trained listener, too. (This was just illustrated in another thread, where a Harman dealer provided subjective comparisons of several JBL speakers. These were dismissed out of hand by two ASR members. But perhaps they shouldn’t be, depending the dealer’s How to Listen score.)

By the'weeks apart' speakers, are you referring to specific reviews? I do remember Amir stating he hooks up the revels to compare, but probably not in every review.

Olive did also mention that those who trained to be a listener but couldn't give consistent evaluations ultimately took on more marketing roles. But definitely, if the dealer has a'score', it's an additional data point for those considering his wares.

solderdude · Aug 13, 2020

Kal Rubinson said:
Have you done this blind?

Hard to do with headphones

Well maybe if I inject my head with the same stuff the dentist uses.

solderdude · Aug 13, 2020

Sean Olive said:
I'm not knocking Rtings. I appreciate the valuable work they do. In the past few months, I believe they have adopted the Harman Target Curves for both IE and AE/OE headphones when they calculate their scores. although above 1-2 kHz it varies to compensate for the different test fixture they use (Head Acoustics vs. GRAS). They also calculate the SQ rating different than we do, giving some weight to other factors. So the scores in my graph may not reflect the changes they've made.

Yes, they did replace their own correction with Harman target or at least moved closer toward it. I know you are not knocking ratings. Merely mentioned they do listen but do not 'manipulate' their rating on it.

Kal Rubinson · Aug 13, 2020

solderdude said:
Hard to do with headphones

Obviously, I was referring to speakers. (I tend to be blind to headphones universally.)

solderdude · Aug 13, 2020

I worked in a hifi store (long ago) and built a relay controlled switch box so you could compare 3 amps and about 10 speakers in the same room.
Of course I did not realize it was kind of important to short the not connected speakers them being close together. Also placement differed obviously.
Naturally played around with the switcher. It did not adjust the volume per speaker either.
The results will be no surprise to you.

Blumlein 88 · Aug 13, 2020

Kal Rubinson said:
Have you done this blind?

Like this:

bobbooo · Aug 14, 2020

I think there’s been some repeated misunderstandings / misrepresentations of each side of the debate by the other on here. These are the two opposing hypotheses/claims under debate:

Experienced/trained listeners are no less susceptible to sighted bias than average
Experienced/trained listeners are less susceptible to sighted bias than average

Drs Toole and Olive’s study has been cited in support of claim 1. Below are the results from the paper that can be used to compare experienced listeners' preference ratings to the average for experienced and inexperienced listeners as a whole, for blind and sighted tests. Note: the speaker ratings are likely naturally compressed due to listeners' contraction bias (not using extreme ends of the scale), which is common in subjective evaluations. Rescaling the rating axis from 4/5 to 8/9 is simply done to make the data more readable, and visually correct for this contraction bias, so there's no conspiracy there.

Average for experienced and inexperienced listeners (same data as the first graph in Sean Olive’s blog often reproduced on here, but in a different format):

Experienced listeners (the more pertinent graph to this discussion, which I don't think has been discussed yet):

So for experienced and inexperienced listeners as a whole shown in the first graph, on average only the preference order of speakers S and T changed places between sighted and blind listening, but for experienced listeners only, the preference order completely changed - when listened sighted it was: D, G, T, S, whereas the preference order during blind listening was: S, D, T, G. The difference in score between the blind and sighted ratings given for the same speaker by the experienced listeners is also larger on average than this difference for all listeners. This suggests the experienced listeners were at least equally (if not more) affected by sighted bias than the average of all types of listeners.

The study also compared how sensitive the listeners were to changing acoustic variables in sighted and blind listening, in this case two speaker positions, 1 and 2.

Average for experienced and inexperienced listeners:

Experienced listeners:

Both graphs show speaker location had a strong influence on preference when blind, yet little effect when sighted, again showing that experienced listeners are just as affected by sighted bias as all listeners are, which in this case deafens ('blinds'

) them to actual acoustic changes caused by speaker positioning, which they recognised fine during actual blind listening. All these results support claim 1 above.

Now, from what I can tell, the two main objections to the study seem to be:

(a) The listeners’ bias is not representative of and much greater than Amir’s possible bias, due to them being Harman employees and three of the four speakers being Harman brands
(b) The study's definition of experienced/trained listener is too inclusive

Starting with objection (a), I think @preload made some great points here. Simply investing a large amount of money in, owning and very much liking a brand’s products and design philosophy can in itself foster a subconscious brand loyalty and so cognitive bias. Sure this would likely not be as much as the bias the Harman employees had for their own speakers, but there are all the other possible biases @Sean Olive mentioned that are still on the table and common possibilities to all sighted listening tests. Even if objection (a) is valid, and an extreme position is maintained that the only valid results are for the non-Harman speaker ‘T’, the last graph above showing experienced listener ratings does show a significant change in rating given for this speaker in ‘position 2’ between sighted and blind, shifting it from being ranked third sighted, to last blind, which notably runs counter to any possible bias against speaker T due to it being a rival brand, suggesting the remaining biases, that are common to all sighted listening tests, play a relatively large role. The graph above for all listeners shows the same shift in ranking of speaker T in position 2 from sighted to blind, and a similar change in rating (again less than for experienced listeners), echoing the results from the first two graphs of this post, again showing the experienced listeners were at least equally if not more affected by sighted bias, even when listening to a speaker they had no vested interest in.

So what about objection (b)? Here's how experienced/inexperienced listeners are defined in the study (my emphasis):

In these tests, listeners were considered to be inexperienced if they had no previous experience in controlled listening tests. Other definitions are possible, which might include persons with no critical listening experience whatsoever.

The bolded parts imply an experienced listener is one who has had at least critical listening experience and controlled listening test experience. This doesn't sound too inclusive to me. And even if it is, and doesn't meet the requirements of a 'highly experienced/trained' listener (whatever they are), it makes sense that this experience is a continuum of ability, which would mean at worst the study is suggestive evidence that even highly experienced listeners are also no less susceptible to sighted bias than others (claim 1). What scientific research is there in evidence of the opposing claim 2 at the beginning of this post, that experienced listeners are less susceptible to sighted bias? If there is none, then claim 1 is on stronger ground. If you take the extreme (and I'd say irrational) view that this study contains zero evidence for claim 1, then the two claims are on equal footing, and you should remain agnostic. The fact remains that claim 2 is a claim of exception however, that goes against not only this study, but cognitive science as well - I'm not aware of any scientific studies showing sighted biases can be noticeably reduced through knowledge of them and training. In fact, this would be a prime example of the (ridiculously named, but very real) G.I. Joe fallacy. When it comes to cognitive biases, knowing really isn't half the battle - in fact it's not even close:

It should be noted that as Sean said here, Harman now have a more exacting definition of a trained listener – passing level 8 or higher in their How to Listen software, with normal audiometric hearing, and showing good discrimination and consistency in their sound ratings. I believe Amir has said he reached level 5/6 (still much better than audio dealers who only passed level 3), and I presume ‘normal’ hearing precludes people with notable presbycusis that can start to become significant in terms of sound judgement variability after around age 50 (as Floyd Toole has humbly described with reference to his own hearing and I mentioned in this post). 'Normal hearing' would obviously also preclude those with notable NIHL which could occur due to such activities as, ahem, routinely listening to headphones at ‘earlobe resonating' volumes

. Of course, Amir has specific training in identifying small lossy digital compression artefacts (I believe primarily via IEMs/headphones, speakers being notoriously harder to hear sound imperfections with), but the relevance of this specific skill to discerning differences in speakers’ acoustic attributes at normal listening volumes and distances, and to what extent if any this skill could balance out the high stipulations for a Harman trained listener above is debateable.

But the bigger picture here is that sighted bias is just the tip of the iceberg in terms of the nuisance variables needed to be controlled for listening tests to be useful in drawing reliable conclusions. Some of these have been controlled for here, but there are major exceptions in addition to standard sighted bias: measurement bias (from seeing the spinorama before listening), no level-matching, and no instantaneous A/B switching (instead mostly comparing speakers over days, weeks and months, relying on long-term auditory memory which is notoriously unreliable). And this isn’t even considering the fact that this is a single listener whose perceptions are not as generalizable as a collection of listeners, or any of the other methodological controls put in place in a scientifically controlled double-blind study Sean mentioned here. The gulf between those studies and the listening tests here really is huge.

Please note: this post is in no way either an attack on Amir, or a demand (or even a request) to change his listening methodology (this would obviously be impractical for one person and especially during a pandemic, and he's doing all of this for free so I would never demand anything). I don’t think anyone else is taking these positions either, and of course we are all incredibly grateful for the frankly mind-boggling amount of work he’s put in to this project. However, it has been claimed that the subjective impressions are ‘data’, from which conclusions can be drawn about the accuracy and validity of Sean Olive’s speaker preference rating formula. If this is the case, this necessitates the same analysis and scrutiny of the ‘measuring instrument’ and method of data collection as has been exacted on the Klippel NFS data. If this is objected to or ignored, then it's simply inconsistent and unscientific to maintain the subjective judgements are data, and not informal impressions (which is what they seemed to start out as, and personally I was fine with). I am also not saying the impressions have zero utility either - they can definitely point in interesting directions for fully controlled listening tests to investigate further. But any claims made by anyone that conclusions can be drawn about the validity of the preference formula from these impressions are not really tenable, as partially-controlled, sighted, single-listener tests are simply incongruent with the well-controlled, double-blind tests by hundreds of listeners the formula is based on.

valerianf · Aug 14, 2020

Well cars are tested and rated by professionals without any way of doing a blind test!
What is important:
1) The tester abilities.
2) The tester equipment.
3) The tester independence.
That is all what is required.

The frailty of Sighted Listening Tests

Deleted member 17820

Guest

Senior Member

Senior Member

Addicted to Fun and Learning

Senior Member

Addicted to Fun and Learning

Grand Contributor

Grand Contributor

Master Contributor

Grand Contributor

Master Contributor

Senior Member

Addicted to Fun and Learning

Grand Contributor

Grand Contributor

Master Contributor

Grand Contributor

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Similar threads