I cannot trust the Harman speaker preference score

IPunchCholla · Mar 6, 2022

dfuller said:
Well, yes, the problem is that limitation leads to some rather bizarre results that don't always match with reality.

In what way do they not match reality? Do we know people prefer B&Ws more than AirPods? I would think they would measure better, but I have no idea if more people would prefer the sound.

afranta · Mar 6, 2022

Xulonn said:
Those who understand science would automatically be aware of the limitations of those findings based on the methodologies of the test.

The problem is that on this and other sites where science-literate people gather, there are also many others who join in do not understand the limitations of science and statistics-based testing, and extrapolate limited conclusions to be conclusive. And then the endless arguments begin...

I understand enough of the science to distinguish between preference scores and metrics that might usefully predict my preferences. Which is what I wrote.

Asinus · Mar 6, 2022

The formula is a regression of the dataset available at the time the original paper was published, it was not meant to be a silver bullet for deciding which speaker is better for all eternity but to show that objective measurements (the spinorama in particular) had enough information to predict consistently and reliably subject preferences in double-blind tests.

I think we must be careful comparing an Olive score calculated from HATS in an anechoic chamber spin vs a Klippel NFS vs manual gated measurements if the scores are close enough due to differences in smoothing, resolution and calibration. Scores calculated from Amir's measurements and Harman published data for the same speaker are not the same, while the spin look pretty much the same but smoothed in Harman's spin.

amirm · Mar 6, 2022

Asinus said:
Scores calculated from Amir's measurements and Harman published data for the same speaker are not the same, while the spin look pretty much the same but smoothed in Harman's spin.

This is clearly one of our problems. I am confident a different set of coefficients would have been computed if our data was used instead of original Harman measurements. We also have far richer set of speakers measured now which may have again led to changes to the formula. Alas, we don't have controlled listening test data to compute the parameters ourselves.

jae · Mar 6, 2022

While there may be good sounding speakers that don't score as expected, it is fair to say that the scoring system does not let any bad-sounding or ill-preferred speakers near the top. If you are honest with yourself and can acknowledge this, then you also must admit the methodology is more "useful" than flawed- which is far from useless.

What is more important, the magnitude of the score itself or understanding why something scored what it did with respect to the determinant parameters? I like using Pierre's site (https://pierreaubert.github.io/spinorama/scores.html), and it is very easy in my opinion to see why a seemingly good or bad sounding speaker did not get a certain score, and why all the top and low scoring speakers got the score they did. If subjectively, to you, one of the parameters isn't that important for whatever reason but you understand how it biases the score, you can still effectively use the scoring system along with the available data to judge speakers. A score alone can tell us quite a bit but it is only one figure which is the biggest limitation.

My time is finite and valuable, if I have to buy speakers or make a shortlist for audition, or if I have to make a recommendation to another person I am starting at the top of this list with no exception, then moving down considering the budget, listening setting/environment, and so on. The subtleties and nitpicking and waxing personal preference when it comes to one versus the other(s) can come afterwards. I have no interest to play the game of looking for proverbial "diamonds in the rough" in the middle of the chart that perform well in my special room at 11-o-clock on a Tuesday when all the best speakers with a high degree of veracity are already on a list for me.

pierre · Mar 6, 2022

amirm said:
This is clearly one of our problems. I am confident a different set of coefficients would have been computed if our data was used instead of original Harman measurements. We also have far richer set of speakers measured now which may have again led to changes to the formula. Alas, we don't have controlled listening test data to compute the parameters ourselves.

The score is useful in a variety of ways:
- takes into account flatness of on axis, listening window and predicted in room
- takes into account how much bass you have and some properties of the bass
- it is not an opinion

A lot of companies design for good score: KEF, Genelec, etc because it matches what we know about precise reproduction of sound.

The score is not useful in some ways:
- it is about preferences (statistically significant and people over estimate how far they are from the median)
- not precise: +/-0.5 is the same bucket in terms of statistics
- designed from bookshelves and tower speakers monopole and for far field listening: doesn’t mean much for in wall, omnis, surround …
- does not take into account SPL and distortion: rank large and small speakers the same if they have close spinorama
- it is unclear if we have preferences when the score is high and SPL good enough for the test. For example: if you put a genelec one and a kef in the same room, both are very good after EQ, sound different, but divergence in preference will decrease (my observation).

How I use the score:
- must be over 6
- SPL and distortion must match my requirement for a room

Then I have a good speaker.

pierre · Mar 6, 2022

amirm said:
This is clearly one of our problems. I am confident a different set of coefficients would have been computed if our data was used instead of original Harman measurements. We also have far richer set of speakers measured now which may have again led to changes to the formula. Alas, we don't have controlled listening test data to compute the parameters ourselves.

I have been wondering for a long time if the parameters would be very different or not. I think not. We would increase precision but the score would still be very correlated to a flat PIR, a flat LW and as much bass as possible. SPL, distortion and possibly time alignment or group delay would be great to have mixed into the score.

DanielT · Mar 6, 2022

Directivity and FR perhaps a matter of opinion, subjectively what one likes BUT distortion. Does anyone like speakers with audible distortion?

But I know someone who likes it:

I'll teach it "- and slashed the speaker cone. It changed the sound of my guitar

The Kinks: How Dave Davies’ Slashed Amp Created Rock Distortion

When it comes to breakout singles, they don’t get much better than “You Really Got Me.” The 1964 track didn’t just put the Kinks on the map; it changed the rock n’ roll landscape with its incendiary guitar tone. “You Really Got Me” brought distorted guitar to the masses. It’s the genesis of all...

www.thaliacapos.com

But that is a consciously created musical expression. That's another thing. Then it is instead a question of a yummy distortion.

Edit:
FR can be changed according to taste (at least on axes, to a certain extent). Okay from the beginning straight FR on axes facilitates to later EQ to taste.

But distortion can not be changed. It is what it is.

Directivity. Then you have to test yourself. For example, dipoles. Some like them, others do not like them at all. Same thing with electrostatic speakers. By the way, that is good.

The world would be a sad place if everyone liked exactly the same.

Frgirard · Mar 6, 2022

tuga said:
How many of us have gone to a dealer and asked to listen to a single speaker?
“Ah, this one sounds better, I’ll have two to take away, please…”
Or brought a single speaker home to test?
It’s absurd.

No. A habit to change. Especially since sellers lend.

amirm · Mar 6, 2022

pierre said:
A lot of companies design for good score: KEF, Genelec, etc because it matches what we know about precise reproduction of sound.

I have not heard of any of these companies or even Harman using the score for any speaker design. Where have you read this?

What companies are designing to is the fundamentals of flax on axis and smooth directivity. That is very sound and ideal. But it doesn't equate to computing the score and trying to up the number.

abdo123 · Mar 6, 2022

It’s not like we have 23 mathematical models to calculate speaker preference.

We work with what we have imo. And with the ever decreasing interest in Audio reproduction i doubt that there will ever be research of that magnitude in the future.

If it’s nothing or the score i would always pick the score but at the same time we must understand its limitations. Dr. Olive himself said he would not trust a score difference below one point to be significant.

jonfitch · Mar 6, 2022

The preference score is just a score of the frequency response. Obviously anyone listening to the Sonos Roam will hear it's deficiencies compared to other speakers in terms of lack of dynamic punch from the DSP limiters, and the extremely loud hiss, which all detract from the actual perceived performance of the speakers.

tuga · Mar 6, 2022

sarumbear said:
Sorry what is absurd?

The formula, or more to the point, the methodology used to determing listening preference.

This is the study I was referring to:

Toole's reasoning, which I disagree with (for assessing stereo reproduction), is that mono is more descriminating because it produces larger differences.
In my view, the differences come mostly from differences in directivity (which affect perceived width and spaciousness) when listening to a single speaker (which is not the intended use in a stereo system), but also inadequate positioning of the speaker (dipoles work with the wall behind them).
He is also a proponent of multi-channel, which in a way conflicts with those of us use 2-channel only.

Xulonn said:
It appears that @tuga does not accept the reasoning for testing tonality and directional characteristics with a single loudspeaker rather than a stereo pair. That reasoning is complex and I understand it pretty well, but not well enough to be one who explains it.

It is a matter of fitness for purpose.
Single-speaker assessment is effective for determining frequency response characteristics and identifying issues/distortions, but one of the joys of stereo is the "three-dimensional" illusion, which you can't eveluate by listening to a single speaker.
On top of that, placing a single speaker in the middle of a room will not "show" how it (or a pair) will interact with the room when positioned as a pair (which depends on the speaker's directivity and also the room boundary absorption).

Other aspects have to do with having people assess speakers from a theatre-like set of seats, not always on axis or the correct axis for a particular speaker, and a complete disregard for optimal positioning of speaker and listener in terms of bass even though Toole defends that bass is what people rate highest.
The shuffler is a good attempt but a flawed one in my view. If blind testing does not use adequate methodology it will still produce the wrong results. Unbiased yes, but wong.

I have discussed this a lot in the forum; an example here:

https://www.audiosciencereview.com/...ts-of-room-reflections.13/page-13#post-447274

Sancus said:
I started doing this in shops because of this site and let me tell you, evaluating a single speaker is far far more useful than evaluating in stereo.

Stereo evaluations hide speaker flaws, that's all they accomplish.

E: sry for double post, stupid phone.

See reply above.

TimVG · Mar 6, 2022

Just as a single curve cannot encompass loudspeaker performance, neither can a single number.
While it may seperate the obviously flawed from the rest, the score is meaningless without accompanying data - and ironically, once you have that data, the score itself becomes redundant. In other words, I have no use for it.

tuga · Mar 6, 2022

TimVG said:
Just as a single curve cannot encompass loudspeaker performance, neither can a single number.
While it may seperate the obviously flawed from the rest, the score is meaningless without accompanying data - and ironically, once you have that data, the score itself becomes redundant. In other words, I have no use for it.

Agreed.

I would go a step beyond and say that a Spinorama is insufficient to characterise (audible) loudspeaker performance.

Frank Dernie · Mar 6, 2022

sarumbear said:
The more I look into it the less I can trust the Harman speaker quality score. IMHO it is totally meaningless metric. I know the background, I read all the papers even before Harman was involved. However, it works so badly that IMHO it is a useless metric.

Here are some scores that I took from this database to prove why.

The KEF Reference 2C Meta: 5.6
Sonos Roam: 5.5
JBL M2: 5.1

The 17cm x 6cm smart-speaker Sonos Roam scores just shy of a real reference of a speaker from KEF, while a JBL flagship that weighs 60kg scores less than that smart-speaker.

Do I have a case or not?

I agree.
Maybe useful in reducing the size of a short list, but not much more IME.

abdo123 · Mar 6, 2022

sarumbear said:
The more I look into it the less I can trust the Harman speaker quality score. IMHO it is totally meaningless metric. I know the background, I read all the papers even before Harman was involved. However, it works so badly that IMHO it is a useless metric.

Here are some scores that I took from this database to prove why.

The KEF Reference 2C Meta: 5.6
Sonos Roam: 5.5
JBL M2: 5.1

The 17cm x 6cm smart-speaker Sonos Roam scores just shy of a real reference of a speaker from KEF, while a JBL flagship that weighs 60kg scores less than that smart-speaker.

Do I have a case or not?

Not sure if that has been brought up yet or not but picking out these speakers is a bit extreme out of the entire database.

The 2C Meta is vendor measurement, so the score is inflated because the measurements are of lower resolution compared to the near field scanner.

The Sonos Roam honestly measures quite well, but the score is also inflated because it is a time gated measurement (lower resolution).

the JBL M2 measurements are from a nearfield scanner, but honestly the speaker just doesn't measure that good for a digital active design with tens of filters applied. Just shows how good the nearfield scanner is compared to other measurements methods, in particular, the 'Harman Audio Test System'.

Spin%2B-%2BJBL%2BM2%2B%2528missing%2Bon-axis%2Bdata%2529.png

CEA2034%20--%20JBL%20M2%20%28Crown%20iTech%205000%20Amp%3B%20M2%20Base%20Configuration%29.png

TimVG · Mar 6, 2022

abdo123 said:
the JBL M2 measurements are from a nearfield scanner, but honestly the speaker just doesn't measure that good for an digital active design with tens of filters applied. Just shows how good the nearfield scanner is compared to other measurements methods, in particular, the 'Harman Audio Test System'.

Well for starters in traditional measurement methods, the oblique angles are missing entirely. Then there's the matter of a reference sample vs production samples.

tuga said:
I would go a step beyond and say that a Spinorama is insufficient to characterise (audible) loudspeaker performance.

In a sense I agree. It's a great tool, better than any score or single curve, to weed out the bad from the good, but there is more to it then that. Otherwise their own blind test champion would also have the best looking spinorama, which it doesn't.

abdo123 · Mar 6, 2022

TimVG said:
Well for starters in traditional measurement methods, the oblique angles are missing entirely. Then there's the matter of a reference sample vs production samples.

Can you explain what you mean by the oblique angles are missing?

Also I wouldn't argue sample to sample variation when there are several resonances in the 200Hz to 500Hz region which from my prespective seem strategically smoothed out.

sarumbear · Mar 6, 2022

Sancus said:
Your first post states that it's useless, which does imply wanting to get rid of it. If that's not what you meant, then you should use a different word, because yes, you did say you want to get rid of it by calling it useless. Useless literally means it has no use. I don't get why you always try to hide behind semantics in so many of your posts. It's very weird and counterproductive.

I said "IMHO it is totally meaningless metric." You showed your ignorance by failing to read and understand that what I said then blamed me by hiding behind semantics.

But you are rude as well. What right do you have to tell a fellow member what words to use and what not to use?

Sancus said:
Anyway, whatever. These threads never go anywhere useful.

Please take your clairvoyance abilities elsewhere and do not pollute this thread if you have nothing positive to add to it.

I cannot trust the Harman speaker preference score

Do you value the Harman quality score?

100% yes

It is a good metric that helps, but that's all

No, I don't

I don't have a decision

Major Contributor

Member

Member

Founder/Admin

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Major Contributor

Major Contributor

Founder/Admin

Master Contributor

Senior Member

Major Contributor

Major Contributor

Major Contributor

Master Contributor

Master Contributor

Major Contributor

Master Contributor

Master Contributor

Similar threads