The frailty of Sighted Listening Tests

patate91 · Aug 15, 2020

whazzup said:
Not really, you can start another discussion on Erin's thread telling him how he should only do blind testing because it's more reliable. Please do not reserve your special attention for Amir only.

Erin's seems to see an humble guy, the way he's interacting on the forum shows that he would certainly questions and critique differently. But sure If something looks "suspicious" I'll ask him.

Again Erin presente the subjective part diffrently. The overall présentation looks more neutral.

whazzup · Aug 15, 2020

Coach_Kaarlo said:
My responses in bold.

The point of Question 1 is simply whether experienced listener / critical listener is one and the same. To me it is not, to you it is?]

Yes and no. Define the criteria for same perhaps.
Unfortunately I have no clue to the actual credentials of the study participants. I've only glanced at sections of the study and an impression that they're not actually trained and tasked to vet speakers, so it's definitely possible that I'm wrong. Olive however has also remarked that it's incorrect to generalize the study results to all sighted / blind testing situations and trained/untrained listeners.

{Yes, let's assume bias is always present. And blind testing trumps sighted testing in accuracy, no one doubts that too.
Under the 2 assumptions above, can a 'critical listener' still dispense his work duties *(with sighted testing, just so it's clear. Blind testing is used when required)?]

Yes, which was my poorly explained point - seeing the speaker does not diminish my ability to tell it is out of phase v in-phase. However, trying to describe esoteric or imaginative characteristics which elude measurement and cannot be discovered / replicated by other similar skilled listeners is the line. And this line would be independent of the listeners credentials.
I do agree with you on this.

[Corporations think they can, hence they're still being trained, paid and relied upon.
But in your opinion, they (or their professional opinions when doing sighted testing) cannot be depended upon because everyone has bias? So they HAVE to do blind testing for their professional opinions to have any weight at all?
*edited to hopefully make the question clearer]

Yes, and no. Depends what their claims are, what they think they are hearing. Can they support those claims with facts which can be discovered or replicated by others? Or is it all in their head (cognitive and other biases).
Given the circumstances of critical listeners being in actual jobs that requires reliable evaluation (imagine a doctor wrong 50% of the time ), and from my personal experiences of jobs, that's where I would reason that they have to be able to suppress bias and make evaluations that CAN be further validated by their peers or superiors. And no, I'm not saying they can hear magical audio qualities that's not validated by measurements.

[No arguments on that. It is up to the individual to read the data and listen to opinions and eventually make his own decisions.
So Amir gives a professional opinion, it's up to the individual to listen, or not. No arguments too I hope?]

Depends on Amrim’s claims - see above.
He has made very specific claims, AND he has provided objective evidence many times. AND he doesn't claim to be infallible. Nor does he claim to hear magical things. In your book, if that's not sufficient, then what is? To me, at this point and what he has done here, the onus is on others to disprove what he said and bring the evidence, not him.

whazzup · Aug 15, 2020

patate91 said:
Erin's seems to see an humble guy, the way he's interacting on the forum shows that he would certainly questions and critique differently. But sure If something looks "suspicious" I'll ask him.

Again Erin presente the subjective part diffrently. The overall présentation looks more neutral.

He is NOT doing blind testing. His review is hence NOT reliable. That has been the position you took with Amir. So now Erin is 'different'? Seriously you're a joke. I'll stop here, no point for me to go any further.

patate91 · Aug 15, 2020

whazzup said:
He is NOT doing blind testing. His review is hence NOT reliable. That has been the position you took with Amir. So now Erin is 'different'? Seriously you're a joke. I'll stop here, no point for me to go any further.

Where do I said that I put weight to his subjective impression? Where do you saw that Erin says his subjective impression are really important? I said that he put efforts to be more neutral and that the subjective impressions are not put forward. To get his personnal opinion you have to dig down the review, he details his protocol each time he publish something, listen before looking at data. Again there's a will to put data above everything else. Sure he's not doing blind test, and I'm pretty sure that le I tell him his not doing blind and it is not reliable he'll eau something : Maybe your right, take it with a grain of salt. He'll make a joke, and maybe share something about it. Again le there's something suspect I'll ask, there be no reason too not ask.

preload · Aug 15, 2020

aarons915 said:
I think most agree with #2 and sighted impressions can be valuable but as far as #1, the Harman training makes no claim that it reduces sighted bias, the study they conducted shows that bias affected the experienced listeners more than the inexperienced listeners. Finally, Harman to this day still conducts blind testing with their trained listeners, there would be no need to do that if they felt the trained listeners wouldn't be biased in any way.

Dr. Olive's posts made all of this pretty clear in my opinion so since blind testing isn't feasible at this time, the main takeaway of the thread is that the CTA-2034 measurements are king followed by the subjective impressions.

As a point of clarity, Sean Olive specifically mentioned "meaningful objective measurements that predict [the results of well-controlled, double-blind listening tests]." I interpret this to refer to the application of his published regression formula, which involves mathematical analysis, to a subset of the CEA-2034 "spin" charts. The accuracy of the formula has been quantified. However, to date, I have not seen evidence that simply "eyeballing" a series of raw CEA-2034 measurements can reliably predict loudspeaker sound quality, let alone exceed the predictive value of unblinded listening tests in cases where bias can be reasonably controlled (such as those presented by Amir). (Notice I said "reasonably controlled," not eliminated.)

Blumlein 88 · Aug 15, 2020

preload said:
As a point of clarity, Sean Olive specifically mentioned "meaningful objective measurements that predict [the results of well-controlled, double-blind listening tests]." I interpret this to refer to the application of his published regression formula, which involves mathematical analysis, to a subset of the CEA-2034 "spin" charts. The accuracy of the formula has been quantified. However, to date, I have not seen evidence that simply "eyeballing" a series of raw CEA-2034 measurements can reliably predict loudspeaker sound quality, let alone exceed the predictive value of unblinded listening tests in cases where bias can be reasonably controlled (such as those presented by Amir). (Notice I said "reasonably controlled," not eliminated.)

Well I'd say eyeballing the CEA-2034 spin charts usually is reliable when the results are very poor of it being a poorly received speaker. Now when eyeballing good results and you can not tell it has been usually (not always) from other factors not in that formula. Like inadequate power handling, or inadequate power handling just in the bass. Or because some range has excessively high distortion and it is heard (usually around the crossover points).

So even at this juncture it looks like some refinements of filtering out poor speakers could be done.
Poor spin results...........out.
Excessive distortion..........out.
Inadequate power handling...........out. Power handling could be measured though currently Amir detects it via his listening tests.

Everything passing these hurdles might sound good or still may have issues, but you've narrowed it down usefully by then.

amirm · Aug 15, 2020

bobbooo said:
But any claims made by anyone that conclusions can be drawn about the validity of the preference formula from these impressions are not really tenable, as partially-controlled, sighted, single-listener tests are simply incongruent with the well-controlled, double-blind tests by hundreds of listeners the formula is based on.

There is this constant attempt to exaggerate the current situation and then complain about it. No one is drawing any conclusions about anything from my subjective listening tests. They are an independent data point with their positives and negatives.

On the positive front, I check for such things as how loud a speaker plays before bottoming out. Nothing in the preference score tells you this. And you have no case to make that i must listen blind to determine when a woofer is crackling or clearly bottoming out.

I can also make judgement about the design of a speaker. For example, lack of directivity control during the crossover region. Again you need no blind test to determine this.

I take these and other issues into consideration as I make up my mind to give a Yes/No recommendation to a speaker. That's all. No science is being made. No research is disputed. I am simply sharing my opinion as the only person who has the speaker at hand, has tons of others to compare it to, and experience of having done that nearly 80 times in six months.

Don't expand the borders of this and then complain that it is too broad. I have not stated a single conclusion across all of these listening tests and measurements.

It is preposterous to take the position that no one can ever, without a formal study, tell you a speaker that they just measured and listened to, cannot be followed with "sure, I would by it," or "no, I don't think I will."

Maybe you all live in some fantasy land where only forward progress is made with controlled studies and else, you do nothing. I am not that type. I am engineer first and foremost and make progress with tools and data I have in hand. I have to deliver value to membership and one of those key values is whether I personally think, with all the data I have, the product is worthwhile or not.

Please keep mind that there is not one argument or citation you put forward that I don't know and have not read or practiced. The fact that I am doing what I am doing means that I consider it proper and defensible. So constantly telling me about the same things doesn't move the dial forward at all.

Now, you could develop data I don't have. Take all a few of the speakers I have evaluated and perform a double blind controlled test and show that preference is different than what I stated. Be sure to use trained and critical listeners. Then we have something new. Until then, arguing is not data.

patate91 · Aug 15, 2020

bobbooo said:

Maybe I'm not so dumb

Duke · Aug 15, 2020

amirm said:
Maybe you all live in some fantasy land where only forward progress is made with controlled studies and else, you do nothing. I am not that type. I am engineer first and foremost and make progress with tools and data I have in hand.

Well said. Engineers plow with the horses they've got, because otherwise nothing gets done. And it's the engineers, not their critics, who are making actual contributions.

snackiac · Aug 15, 2020

amirm said:
There is this constant attempt to exaggerate the current situation and then complain about it. No one is drawing any conclusions about anything from my subjective listening tests. They are an independent data point with their positives and negatives.

This is the part of this thread I don't get. It's up the reader to decide how much weight to give the subjective data point. You've stated your credentials, we have a large database of your subjective reviews to correlate with Olive score and spins, and you receive independent review samples. It's SO MUCH MORE than we usually get to ascertain someone's subjective value.

So some have decided to give it less weight, good for them, but now can't stop talking about it as if they can't handle that others are still giving it credence.

preload · Aug 15, 2020

Blumlein 88 said:
Well I'd say eyeballing the CEA-2034 spin charts usually is reliable when the results are very poor of it being a poorly received speaker.

Sure, I agree. Similarly, a sighted inspection of a loudspeaker may also quickly reveal that it will sound poor or be poorly received - examples might be observing that it consists of a single 3" driver in a plastic cube a la Bose or that somebody poked in the tweeter. But is the ability to "rule out" lousy speakers really the purpose of the spin?

Now when eyeballing good results and you can not tell it has been usually (not always) from other factors not in that formula. Like inadequate power handling, or inadequate power handling just in the bass. Or because some range has excessively high distortion and it is heard (usually around the crossover points).

I would add that "eyeballing" the spin charts doesn't mathematically weight or even consider much of the data being presented - we are human and not computers. For instance, can you eyeball a FR chart, calculate a regression line through it, and determine whether the slope of that line is tilted down in a way that will make the speaker sound dull or bright? And can you perform this task accurately, to a fraction of a degree, across the multiple FR charts available in the spin, knowing that the ideal slope differs depending on which curve you're evaluating? Or perhaps you don't consider the slope (Olive's regression doesn't either, for reasons I think I know.) And that brings me to another point, and that there does not appear to be a standard, agreed-upon, method of evaluating spins to begin with, including which curves deserve more weight and what aspects of them (smoothness, slope, directivity, etc.) matter more.

Which is why I say that Olive's regression equation (based on mathematical analysis of a subset of spin charts) can predict blinded preferences, and its accuracy has been quantified across a representative sample of 70 loudspeakers (and likely way more to date). Whereas when it comes to eyeballing a complete set of spin charts - who knows. And I'd even suggest that sighted listening evaluations performed by a motivated and experienced listener with reasonably low exposure to bias can probably exceed the accuracy spin charts when "eyeballed" by most people when it comes to predicting blind listening preferences.

valerianf · Aug 16, 2020

Amir contribution is important because he brings data and known process during the tests.
Science is guiding my choices, not any blind test.
Thank you Amir.

Thomas savage · Aug 16, 2020

amirm said:
There is this constant attempt to exaggerate the current situation and then complain about it. No one is drawing any conclusions about anything from my subjective listening tests. They are an independent data point with their positives and negatives.

On the positive front, I check for such things as how loud a speaker plays before bottoming out. Nothing in the preference score tells you this. And you have no case to make that i must listen blind to determine when a woofer is crackling or clearly bottoming out.

I can also make judgement about the design of a speaker. For example, lack of directivity control during the crossover region. Again you need no blind test to determine this.

I take these and other issues into consideration as I make up my mind to give a Yes/No recommendation to a speaker. That's all. No science is being made. No research is disputed. I am simply sharing my opinion as the only person who has the speaker at hand, has tons of others to compare it to, and experience of having done that nearly 80 times in six months.

Don't expand the borders of this and then complain that it is too broad. I have not stated a single conclusion across all of these listening tests and measurements.

It is preposterous to take the position that no one can ever, without a formal study, tell you a speaker that they just measured and listened to, cannot be followed with "sure, I would by it," or "no, I don't think I will."

Maybe you all live in some fantasy land where only forward progress is made with controlled studies and else, you do nothing. I am not that type. I am engineer first and foremost and make progress with tools and data I have in hand. I have to deliver value to membership and one of those key values is whether I personally think, with all the data I have, the product is worthwhile or not.

Please keep mind that there is not one argument or citation you put forward that I don't know and have not read or practiced. The fact that I am doing what I am doing means that I consider it proper and defensible. So constantly telling me about the same things doesn't move the dial forward at all.

Now, you could develop data I don't have. Take all a few of the speakers I have evaluated and perform a double blind controlled test and show that preference is different than what I stated. Be sure to use trained and critical listeners. Then we have something new. Until then, arguing is not data.

You need to engineer yourself some less easy to grab pant legs.

edechamps · Aug 16, 2020

preload said:
is the ability to "rule out" lousy speakers really the purpose of the spin?

Well… yes?

We should keep in mind that back when the spinorama concept was pioneered by @Floyd Toole, there were a lot of really bad speakers on the market, with large variations in spinoramas. Examples given in Toole's book are not subtle, to the point where no-one would have any problem predicting the most preferred speaker just by "eyeballing" the graphs. I think we can all agree that spinoramas are effective in separating really bad speakers from good ones.

However, here on ASR we don't resist the temptation to go further and routinely go through spinoramas with a fine-toothed comb to try to discern small differences between two well-behaved speakers. The debate on the SVS Ultra Bookshelf is one example. I am not sure this approach makes sense. When faced with similarly good spins, it's hard to determine if one speaker will really be preferred over another. If it is, then other factors (e.g. non-linear distortion) might better explain the perceived differences. (When @Sean Olive stated in its preference rating study that non-linear distortion is not a significant factor in loudspeaker preference, we should keep in mind that this was written in the context of evaluating a large sample of speakers with wildly varying spinoramas - not trying to nitpick small differences between two similarly good speakers.)

preload said:
Which is why I say that Olive's regression equation (based on mathematical analysis of a subset of spin charts) can predict blinded preferences, and its accuracy has been quantified across a representative sample of 70 loudspeakers (and likely way more to date). Whereas when it comes to eyeballing a complete set of spin charts - who knows.

The problem with @Sean Olive's model is that its predictive power tends to degrade quickly when comparing speakers that are less than 1.0 points apart. This makes it unlikely that it will help when comparing speakers that have similarly good measurements.

HooStat · Aug 16, 2020

edechamps said:
comparing speakers that are less than 1.0 points apart

I agree with you wholeheartedly. There is also no evidence that his model is the optimal model to evaluate "good" speakers. There are many other inputs he could have tried, and many other ways to quantify the inputs he did try. With ~ 70 data points, one's ability to fit any model is severely limited. The Olive model is based on his intuition and experience in what makes for good sound (including previous research), his creation of variables that capture that intuition, and his fitting a model to the data. If 10 other people did the process, there would be 10 other models, most of which would probably be reasonable, but different.

Plus, the model had very little data at the highest preference scores, so we don't actually know what discriminates among the best speakers based on this model. As someone said, "All models are wrong. Some are useful."

preload · Aug 17, 2020

edechamps said:
Well… yes?

We should keep in mind that back when the spinorama concept was pioneered by @Floyd Toole, there were a lot of really bad speakers on the market, with large variations in spinoramas. Examples given in Toole's book are not subtle, to the point where no-one would have any problem predicting the most preferred speaker just by "eyeballing" the graphs. I think we can all agree that spinoramas are effective in separating really bad speakers from good ones.

Okay. I can see how that capability might be useful to some people, and the historical context is interesting. Thanks @edechamps. Personally, a predictive tool that differentiates speakers in the upper tier would be most useful to me, in that it's not particularly interesting (to me) to predict that a $100 Bose cube isn't going to sound as good as, say, a Genelec.

The problem with @Sean Olive's model is that its predictive power tends to degrade quickly when comparing speakers that are less than 1.0 points apart. This makes it unlikely that it will help when comparing speakers that have similarly good measurements.

Doh.

Perhaps then sighted listening observations in conditions of low levels of bias may be superior to: a) Olive's regression and b) simple eyeballing of spinorama charts, in differentiating/comparing the sound quality of the highest tier loudspeakers.

bobbooo · Aug 17, 2020

edechamps said:
The problem with @Sean Olive's model is that its predictive power tends to degrade quickly when comparing speakers that are less than 1.0 points apart. This makes it unlikely that it will help when comparing speakers that have similarly good measurements.

This is to be expected from any model though - the closer the objective measurements of two speakers, the greater the variance in relative subjective ranking they would receive from a set of listeners.

HooStat · Aug 18, 2020

preload said:
it's not particularly interesting (to me) to predict that a $100 Bose cube isn't going to sound as good as, say, a Genelec

The point of the model is not that it can do this seemingly "easy" task. The point is to identify WHY/HOW it can do the task. In other words, is it the smoothness of the frequency response in the listening window? That is where the insights come from.

edechamps said:
predictive power tends to degrade quickly

It isn't actually that its predictive power degrades. The estimate is still the best estimate. It is that the UNCERTAINTY increases.

preload · Aug 18, 2020

HooStat said:
The point of the model is not that it can do this seemingly "easy" task. The point is to identify WHY/HOW it can do the task. In other words, is it the smoothness of the frequency response in the listening window? That is where the insights come from.

I'm not quite following what you're saying, or why you decided to "correct" my statement. Perhaps it's semantics.
Is the point of the Olive regression formula not to predict loudspeaker preferences using objective measurements? Because that's the title of the paper (for starters).

HooStat · Aug 18, 2020

preload said:
I'm not quite following what you're saying, or why you decided to "correct" my statement. Perhaps it's semantics.
Is the point of the Olive regression formula not to predict loudspeaker preferences using objective measurements? Because that's the title of the paper (for starters).

I was most definitely not correcting anything. I was trying to point out that the model is useful for many things beyond a binary prediction. You had mentioned that you were not interested in predicting a preference for Genelec over Bose, and I also don't really care about that either. But regression models are very useful creatures -- they don't just "predict". They also describe associations between inputs (speaker measurements) and outputs (preference scores). My point was that the true value of the regression equation is that it helps explain why Genelec gets a better score than Bose. And this suggests why one might prefer one of two speakers that have closer scores (say Genelec vs. Revel). The limitation is that the model doesn't have much information at higher scores, so it requires stronger assumptions to make that statement (i.e., to assume that the factors that describe preferences at lower preference levels work the same way at higher preference levels). These assumptions may not be valid.

The other quote about predictive power was to point out that the predicted preference difference is accurate and that there is no degradation of its ability to "pick a winner" (subject to the assumptions in constructing and interpreting the model). However, the uncertainty of the predicted winner is much higher when the predicted preference scores are close together. And this makes sense -- the closer two things are, the more likely for being randomly wrong when trying to make a choice between them. The other thing the model says is that even if you pick "wrong" you are not likely to be very wrong. At some point small differences are no longer meaningful -- since scores are provided on an integer level, I would suspect that differences of less than 0.5 points start to become unimportant.

The frailty of Sighted Listening Tests

Active Member

Addicted to Fun and Learning

Addicted to Fun and Learning

Active Member

Major Contributor

Grand Contributor

Founder/Admin

Active Member

Major Contributor

Member

Major Contributor

Addicted to Fun and Learning

Grand Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Major Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Similar threads