• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Speakers measurements anatomy

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
1665760152035.png

This is what Harman's r=0.86 (and r-squared of 0.74) looks like visually. Look at the variation in measured preferences vs predicted for a predicted "5" and a predicted "6." Meaning, your computerized analysis clearly showed a 1-step differenced (which is a huge jump considering the predictions mostly cluster between 3 and 7), but listener preferences varied all across the board without clear differentiation.
 

fluid

Addicted to Fun and Learning
Joined
Apr 19, 2021
Messages
694
Likes
1,198
Put very simply, r=0.86 is really good in order to be able to make the argument that the Harman model of measurement interpretation correlates well with listener preferences, compared to prior models (like the Consumer Reports model). BUT, when it comes to accurately predicting those listener preferences, you have to go further and understand that the r-squared is only 74%. And if you look at the actual vs predicted score chart visually, it's very clear how much uncertainty that actually describes.
I don't really disagree with you other than you seem to see the glass as three quarters empty rather than three quarters full. To me it shows that three quarters of what determines preference in most cases is frequency response either on axis or the balance of the on axis to off axis and the smoothness of the response. This is pretty much Floyd Toole's mantra of flat on axis with smooth well controlled directivity and a lack of resonances.

To me the difference in confidence level between the 13 speakers and 70 speakers shows that when the speakers are all very similar in size and construction, really the only thing that separates them is the frequency response in one form or another. When the speaker types and construction get more varied there is more to think about and the ~25% of uncertainty requires more data than frequency response / directivity alone to determine preference. The model also does not work equally well for all speaker constructions but within the cone/dome paradigm of hifi speakers it is pretty good. So I do agree with what I think you are saying, that if anyone thinks a spin graph by itself is enough to have all the answers, it isn't.

I don't take the score for anything other than sorting the wheat from the chaff. The research that went into generating it gives some really good insights into a lot that matters in speaker preference, but certainly not all. The algorithm or a tweaked version of it is quite helpful in the crossover design of a speaker when evaluating options, if the measurements were accurate and the score is 7 plus the chances of it sounding good are very high. It won't take you from decent to awesome but it's a really good start.
This is what Harman's r=0.86 (and r-squared of 0.74) looks like visually. Look at the variation in measured preferences vs predicted for a predicted "5" and a predicted "6." Meaning, your computerized analysis clearly showed a 1-step differenced (which is a huge jump considering the predictions mostly cluster between 3 and 7), but listener preferences varied all across the board without clear differentiation.
Sean Olive points out in the paper that one area for improvement that was needed was the subjective evaluation. The graph shows why, but it also shows there is a growing trend to consistency of opinion when the score gets quite high. This makes sense to me because I have found it quite easy to tell when something sounds right. It just all clicks into place and you know it. Before then things can still sound good but it's hard to put your finger on just what is wrong and subjective ratings being all over the map makes sense.
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
I don't really disagree with you other than you seem to see the glass as three quarters empty rather than three quarters full.
Well, please don't get me wrong - this series of papers from Olive's group is still novel and groundbreaking and added a ton of understanding about loudspeaker measurements, what aspects of those measurements matter more, and how predictive measurements can be.

What bothers me is that there are some people who, without fully understanding the research, believe that loudspeaker listener preferences can be entirely predicted by measurements, or worse, that eyeballing a series of measurements can even come close. I get that people really really want to believe this to be true (because it's true for solid state devices like DACS and amplifiers), but wishing something to be true does not make it true.

To me it shows that three quarters of what determines preference in most cases is frequency response either on axis or the balance of the on axis to off axis and the smoothness of the response. This is pretty much Floyd Toole's mantra of flat on axis with smooth well controlled directivity and a lack of resonances.
Smoothness and directivity (on/off-axis response) account for less than 3/4, since 30% of listener preferences is associated with bass extension. While Harman's design philosophy involves smooth directivity, and Revel/JBL make really good loudspeakers based on these design goals, it doesn't mean that you can't have a highly preferred speaker that doesn't follow this philosophy. And you can clearly see this by looking at the chart I posted above from Olive's paper, by looking horizontally and seeing how many loudspeakers with the same subjective score had widely differing preferences scores predicted by measurement.

To me the difference in confidence level between the 13 speakers and 70 speakers shows that when the speakers are all very similar in size and construction, really the only thing that separates them is the frequency response in one form or another.
Perhaps. But it also give us an idea of how the final regression would perform when further extrapolating outside the sample. R went from 0.99 to 0.70 when the original model based on the 13 loudspeakers was applied to the larger sample of 70. R^2 of 0.70 is 49%, which means that the extrapolated formula could only explain HALF the variability in listener preferences. So we talk about 74% for the 70 speakers...how much lower is it when we look at speakers that weren't part of that sample of 70?
 

fluid

Addicted to Fun and Learning
Joined
Apr 19, 2021
Messages
694
Likes
1,198
I get that people really really want to believe this to be true (because it's true for solid state devices like DACS and amplifiers), but wishing something to be true does not make it true.
I don't wish to be a contrarian but my own personal experience flies in the face of the idea that all amp's and DAC's sound the same once you reach a certain level of performance. I really wanted to believe that because it would be much easier, unfortunately I can hear a difference. I do not have an ABX level matched printout as proof so further discussion beyond the statement won't of value to anyone..
Smoothness and directivity (on/off-axis response) account for less than 3/4, since 30% of listener preferences is associated with bass extension.
To me that is still part of the overall frequency response / directivity and not a separate entity. That is one of the limitations of the model, it is only dividing parameters of the frequency response up into different quantities.
While Harman's design philosophy involves smooth directivity, and Revel/JBL make really good loudspeakers based on these design goals, it doesn't mean that you can't have a highly preferred speaker that doesn't follow this philosophy. And you can clearly see this by looking at the chart I posted above from Olive's paper, by looking horizontally and seeing how many loudspeakers with the same subjective score had widely differing preferences scores predicted by measurement.
That is true, some things that don't really matter are penalized by the existing algorithm and others that do matter are not given the weight they deserve. This is always likely to occur when trying to fit data. With more and better data the model could be improved to be more correct, more of the time. Whether one person scores a speaker a 3 or 7 by itself doesn't tell you much, as I said before Sean Olive pointed out the subjective variability as an issue. From other research from Olive and Toole it does seem that overall the ranking of speakers from worst to best is quite consistent even though the actual ratings in terms of numbers might be all over the map.
Perhaps. But it also give us an idea of how the final regression would perform when further extrapolating outside the sample. R went from 0.99 to 0.70 when the original model based on the 13 loudspeakers was applied to the larger sample of 70. R^2 of 0.70 is 49%, which means that the extrapolated formula could only explain HALF the variability in listener preferences. So we talk about 74% for the 70 speakers...how much lower is it when we look at speakers that weren't part of that sample of 70?
There is so much about a speaker that is not considered or given a value in the existing preference research from Olive. It is a valiant effort to try and quantify the differences and for the vast majority of very similar cone and dome hifi speakers the result is really quite accurate. The factors that feature heavily in the model are all important features in preferred loudspeakers and if nothing else that is really useful information to know.
 

thewas

Master Contributor
Forum Donor
Joined
Jan 15, 2020
Messages
6,901
Likes
16,906
There you go again! This one appears to be intentional.

So are you acknowledging that a person eyeballing a series of spin charts cannot possibly exceed the accuracy of Harman's computerized analysis in being able to predict loudspeaker listener preferences?
Neither I or others claimed such but you did "What people are claiming here is that that they can eyeball a series of spin charts and do better than the computer" so I asked you a simple question, where exactly did they do so?
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
Neither I or others claimed such but you did "What people are claiming here is that that they can eyeball a series of spin charts and do better than the computer" so I asked you a simple question, where exactly did they do so?
I see what you're trying to do, and again, I'm going to assume that you're not just trying to intentionally miss the point. Taken "literally," nobody here specifically stated "bee beep, I'm better than a computer, bee bop"...and honesty, I feel that repeatedly asking if someone said that is a pretty rigid and hyper-literal interpretation of what I wrote.

Let me try to restate:
1) Harman's own published research demonstrates that computerized analysis of spin measurements is able to account for 74% of the variation in listener preferences in their 70-speaker sample.
2) Outside that 70-speaker sample, the variation in listener preferences explained by computerized analysis is going to be a LOT lower. Let's be generous and say 50%.
3) So when someone comes along and makes a bold statement (or judgement) about a speaker's overall perceived sound quality, such a statement is far more definitive than what Harman's data demonstrate is possible through computerized analysis.
4) And when such a bold statement is made based on "eyeballing" a spin chart, the individually is essentially making the claim that they can more accurately predict subjective loudspeaker quality than Harman's computerized analysis.
 
Last edited:

GaryH

Major Contributor
Joined
May 12, 2021
Messages
1,351
Likes
1,859
View attachment 236981
This is what Harman's r=0.86 (and r-squared of 0.74) looks like visually.
No it doesn't. That's the graph for the ancillary model (with a significantly lower r of 0.79 than the proper model) restricted only to sound power parameters, to counter Consumer Reports' erroneous premise that sound power should be flat.
 
Last edited:

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
No it doesn't. That's the graph for the ancillary model (with a significantly lower r of 0.79 than the proper model) restricted only to sound power parameters, to counter Consumer Reports' erroneous premise that sound power should be flat.
You're right, I pasted the wrong figure in. I should have double-checked.
Here is the correct chart. Figure 5. r=0.86. Note that it doesn't really change anything. Look at the clustering around a predicted pref score of 5 vs 6.
1665891642479.png
 

GaryH

Major Contributor
Joined
May 12, 2021
Messages
1,351
Likes
1,859
@GaryH You sure about that? You might want to double-check before correcting someone. This is Figure 5, which depicts the general anechoic model (not the sound power model). r=0.86.
You sure about that? You might want to double-check your figures before (attempting) correcting someone. That's clearly Figure 13(b) in your post above, which is the generalized sound power model.
 

fluid

Addicted to Fun and Learning
Joined
Apr 19, 2021
Messages
694
Likes
1,198
Let me try to restate:
1) Harman's own published research is only able to show that loudspeaker measurements explain 74% of the variation in listener preferences under their experimental conditions.
2) When generalizing this capability to the typical conversation in ASR (i.e. outside the 70-speaker sample, different types of music, etc.), as explained already, that computerized spin analysis is going to be able to explain FAR LESS than 74% of the variation in listener preferences. It could potentially drop to 50% or even lower than that.
3) So, not even Harman's computerized spinorama analysis will be able to explain more than, say, half of the variation in listener preferences.
4) Which means that every time someone someone eyeballs a series of spin charts and then makes a definitive claim/statement about it's perceived sound quality, that person is essentially making the claim that he/she can do better than Harman's computerized analysis.
There has been much more research by Toole and Olive than the specific ones Olive did trying to produce a metric. This started a long time before they went to Harman and the results have always been consistent, don't forget this when trying to assign numerical values and extrapolating that out from the R values of one. The figures you quote suggest it might be no better than 50/50 and this is just not true.

There has been many more than 70 speakers tested in blind studies at the NRC and Harman and the outcomes are always reported to be much the same. Flat direct sound, with smooth directivity and no obvious resonances is what all the winners have in common. Add extended bass and all the obvious traits are there. These are quite easy to eyeball in a CTA-2034 plot which is I think one of the benefits of it as a standard. The subtle nuances between one speaker and another takes more effort to get to. Resonances stand out, shelved bass, bass extension, too much treble, overall lumpiness, how smooth the listening window, early reflections, DI and power response are.

This won't tell anyone if they will like one speaker more than another if their measurements are quite similar. There is no need for a computer to analyse the measurements to see whether they fit the paradigm or if the speaker has obvious problems.

As an example of eyeballing it, I just made a crossover from someone else's measurements. Once the crossover point and slopes were picked everything was done be eyeballing the slopes to a target and then optimizing the combination by eyeballing the CTA 2034 curves and overall horizontal directivity. I didn't use the score to optimize it in any way but with a sub it scores 8.44 according to Olive's equation. I have done this enough times to be able to do it by eye, I'm sure I am not the only one.

I don't know if I or anyone else would like it more than any other speaker (once the woofer is added) but if the measurements were accurate I can't see it sounding bad.

Example Eyeball.png
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
There has been much more research by Toole and Olive than the specific ones Olive did trying to produce a metric. This started a long time before they went to Harman and the results have always been consistent, don't forget this when trying to assign numerical values and extrapolating that out from the R values of one. The figures you quote suggest it might be no better than 50/50 and this is just not true.
I can tell that you do not understand r and r-squared. "Accounting for 50% of the variation" has absolutely nothing to do with "50/50 guessing." They don't even belong in the same sentence.

The fact that Olive needed to devote an entire introductory section of his paper explaining what regression analysis is and how it is analyzed statistically, I think, speaks a lot to the lack of understanding of this basic concept among the paper's target audience (engineers). My sense is that this contributes to a lack of understanding of what the papers do and do not show.

There has been many more than 70 speakers tested in blind studies at the NRC and Harman and the outcomes are always reported to be much the same. Flat direct sound, with smooth directivity and no obvious resonances is what all the winners have in common.

See, now you're just regurgitating the group think on ASR.

Didn't follow what you posted after that in terms of the point you were trying to make.
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
You sure about that? You might want to double-check your figures before (attempting) correcting someone. That's clearly Figure 13(b) in your post above, which is the generalized sound power model.
Thank you, and I acknowledged my cut/paste error above.

You've clearly read and understand the paper.

So, does the correct figure 5 change anything? The visualization of r=0.86 (r-squared = 0.74) still demonstrates a significant degree of uncertainty in the measured preference rating for a given predicted preference rating over the vast majority of the range. Here it is again. I highlighted the predicted preference scores of 5 and 6, as before.
1665897739211.png
 

fluid

Addicted to Fun and Learning
Joined
Apr 19, 2021
Messages
694
Likes
1,198
I can tell that you do not understand r and r-squared. "Accounting for 50% of the variation" has absolutely nothing to do with "50/50 guessing." They don't even belong in the same sentence.
Condescension aside you are correct I am not a statistician and I am not trying to argue statistics. My main point is that a lot of what makes a speaker preferred in general is known and was known before Olive tried to quantify it. That is what is easy to see just by looking at a CTA 2034 graph, assuming the person looking knows what they are looking at which is not a given.
Thanks for sharing your perspective.
Maybe you find your attitude and responses towards other members perfectly reasonable, I'm afraid I don't and have no interest in engaging in that sort exchange.
 

nerdstrike

Active Member
Joined
Mar 1, 2021
Messages
263
Likes
317
Location
Cambs, UK
While I appreciate a 3-preference point spread is large, I think that's a pretty good observation for 70 human perceptions. Multi linear regression isn't magic, and it's very easy to misuse. So much bad stats work has been published under the auspices of science, and I sure hope these graphs have been well made - it doesn't look supernaturally good, which is a good sign for the veracity of the modelling.

We may not be able to predict speaker preference well, but that doesn't mean we cannot successfully categorise speakers with the same data, even by eye. I place a lot more value on defect/design deficiency detection, which I suspect can be achieved much more reliably.
 

thewas

Master Contributor
Forum Donor
Joined
Jan 15, 2020
Messages
6,901
Likes
16,906
I see what you're trying to do, and again, I'm going to assume that you're not just trying to intentionally miss the point. Taken "literally," nobody here specifically stated "bee beep, I'm better than a computer, bee bop"...and honesty, I feel that repeatedly asking if someone said that is a pretty rigid and hyper-literal interpretation of what I wrote.
Which point, you stated something which doesn't really happen and instead of admitting it was wrong you accuse others that they interpret it the wrongly.

3) So when someone comes along and makes a bold statement (or judgement) about a speaker's overall perceived sound quality, such a statement is far more definitive than what Harman's data demonstrate is possible through computerized analysis.
4) And when such a bold statement is made based on "eyeballing" a spin chart, the individually is essentially making the claim that they can more accurately predict subjective loudspeaker quality than Harman's computerized analysis.
You are mixing up different things and making assumptions to come up to your wrong conclusion.

Experienced people here use the extended sets of measurements to analyse and discuss flaws of loudspeakers and how those will probably sound, see for example a wide dip or peak in the presence region. They might also say that in group of listeners the majority will prefer the one with significantly less flaws but they won't claim that individual XY will prefer it. They also even calculate the statistic preference depending on the score difference which also lead them to the conclusion that when the score difference is smaller than 1 the difference in statistic preference gets quite small.
 

tuga

Major Contributor
Joined
Feb 5, 2020
Messages
3,984
Likes
4,285
Location
Oxford, England
The preference tests are unfit for purpose and that invalidates the data. Regardless of the model.
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
Condescension aside you are correct I am not a statistician and I am not trying to argue statistics.
You don't need to be a statistician to understand regressions, r, and r-squared.
But you DO need to understand these concepts to understand the Olive paper at any basic level!

Which means you and I can't have an intelligent discussion about the Harman research and what it tells us.

My main point is that a lot of what makes a speaker preferred in general is known and was known before Olive tried to quantify it. That is what is easy to see just by looking at a CTA 2034 graph, assuming the person looking knows what they are looking at which is not a given.
Right - the measurement characteristics that seemed to correlate with listener preference was known before Olive, absolutely. But what was NOT known was to what degree individual measurements characteristics contributed to the final listener preference, and with what degree of certainty. THAT is what the Olive paper added to the general knowledge.

Maybe you find your attitude and responses towards other members perfectly reasonable, I'm afraid I don't and have no interest in engaging in that sort exchange.
Yes, and I was attempting to express my disinterest in further conversation with my short "thank you" response.
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,559
Likes
1,703
Location
California
Which point, you stated something which doesn't really happen and instead of admitting it was wrong you accuse others that they interpret it the wrongly.
If I were to write that you are "driving me up the wall," would you then demand to know what type of vehicle and point out that this could not have occurred unless the wall was not perfectly vertical? Serious question.
Experienced people here use the extended sets of measurements to analyse and discuss flaws of loudspeakers and how those will probably sound, see for example a wide dip or peak in the presence region. They might also say that in group of listeners the majority will prefer the one with significantly less flaws but they won't claim that individual XY will prefer it.
Perhaps you and I are reading completely different threads, because I almost never see an "eyeball interpretation" of a spin accompanied by humble qualifiers like "probably" or even "majority might prefer."
They also even calculate the statistic preference depending on the score difference which also lead them to the conclusion that when the score difference is smaller than 1 the difference in statistic preference gets quite small.
I'm not quite following - I don't believe members here are attempting to reproduce Olive's work at the same scale (i.e. "measured" preference scores), although I have seen a few fun living room experiments (that were qualified as such!) posted here. So I'm not sure what you're saying exactly. Which "score" are you referring to exactly? Are you referring to the predicted preference score based on computerized analysis?
 
Last edited:
Top Bottom