Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

PeteL · Aug 17, 2021

Yorkshire Mouth said:
Okay, this is a great experiment, but I feel there’s something more to be done.

It appears that speakers which went lower did better. No surprise there. But what would be interesting would be to throw in a relatively poor speaker, which went lower.

In other words, how much if what we like about a speaker is how low it goes, and how much is made up of how good it is otherwise.

Once again, great experiment, many thanks.

It's an important criteria for sure. But the thing is, it's an objective criteria, not a bias. It does have weight in the harman preference score, If this experience tells us that the bass extension plays a role in the preference, more thorough studies tells us that too. For the record, ASR reviewed the Kef Q100 as an unrecommended product, even tough it has a good "preference" score both in this study and by numbers. I didn't hear them but this would count as a "relatively poor" speaker, at least by Amir's listening tests and measurments. So maybe it tells us that if you have no sub we should give importance to this metric, it objectively enhance the experience?

GaryH · Aug 17, 2021

MatthewS said:
What I will say, is that my rating is likely biased, but not because I own the speakers or have listened to them. Trust me, once you blind it and you randomize it, good luck remembering exactly how they sound, especially when you normally listen in stereo.

It is because I knew what the spin data for the speakers looked like. As I was listening, if something stood out as off or different, I was immediately thinking, those highs sound terrible, that's the OSD. But that's the extreme case with a really terrible speaker. Between the Kef and the Revel it wasn't at all obvious. I even liked the JBL best on one track.

Yes, seeing measurements before listening can significantly bias judgements, no matter how much this is denied on here. All it takes is a shift of focus, for example as you describe to the high frequencies if you've seen an error in the frequency response there.

ROOSKIE · Aug 17, 2021

Regarding preferences.
Dynamics and scale are very important to me.
I am not a bass head per say, that said bass depth and weight contribute a lot to my/many folks perception of dynamics, scale, excitement, power, room filling soundstage, realism - things I value very much.
I also do a lot of electronic music and music with bass guitars and drums.
Especially on familiar tracks, it is difficult to enjoy them as much when known moments of exciting impact are removed or drastically reduced.
( exaggerated or bloated bass is no good for me either)

The other top speaker value for me is long term listening. This is not possible to test quickly. I couldn't care less how a speaker sounds for a few tracks, I MUST be able to enjoy the sound at reasonably loud volumes for as long as I wish. There are speakers that fall off with this requirement that sound quite good initially. (Many speakers fall off here such as KEF's Q150 - something oddly tiring in the highs for me)
It also has to still sound exciting AND have that long term listening covered. (ELAC's DBR62 fell out here for me, so boring as have been several speakers I have tried)

Finally port noise kills the deal. I can not stand any port chuffing even if rare I will hear it. Any consistent port noise sends a speaker out the door.

CtheArgie · Aug 17, 2021

Vince2 said:
Actually it is brave to publish the results with a small sample size. Hard to get significant results with small samples. The fact that there are 2 significant effects means these are very noticeable. To find subtle significant differences you want a large sample.

Actually, in medicine there are lots of examples when studies with small samples have reported results that have not been confirmed in larger ones.

For example, the remission rate in Crohn's disease with infliximab in the study by Targan was done with 28 patients per arm. The larger studies coming after were never ever able to even get close to the remission rate of this one.

To this day, most physicians believe and quote the Targan study despite this fact. Infliximab was even originally approved by the FDA based on this study

uwotm8 · Aug 17, 2021

Bear123 said:
certainly reveals your view on the research

It's my view on zealots, not the research. There was a lot of researches before and there will be more in future. But for some reasons Harman and some their probucts became craizly hyped. And then you say hmm OK lets try. And then listen to its sound. One word: dissapointment. You can call it trolling or whatever but I think that result is everything. Not the process.

Bear123 said:
It's clear that you hold the research in disdain because it doesn't match with your opinion

My opinion is that it's overrated and hyped, again, corresponding to what we got as a result (products).

Bear123 said:
but stereo made things much more difficult and time consuming

It depends on what you're interested in: rough FR/tonality comparison or something more.
If FR is the only thing that matters - that's Harman, not me!

- then pair of JBL 308Ps is an endgame setup.
Or maybe it's a bit more difficult, not sure.

ROOSKIE · Aug 17, 2021

uwotm8 said:
It's my view on zealots, not the research. There was a lot of researches before and there will be more in future. But for some reasons Harman and some their probucts became craizly hyped. And then you say hmm OK lets try. And then listen to its sound. One word: dissapointment. You can call it trolling or whatever but I think that result is everything. Not the process.

My opinion is that it's overrated and hyped, again, corresponding to what we got as a result (products).

It depends on what you're interested in: rough FR/tonality comparison or something more.
If FR is the only thing that matters - that's Harman, not me! - then pair of JBL 308Ps is an endgame setup.
Or maybe it's a bit more difficult, not sure.

Harman does not think frequency response is the only thing that matters.
Where did you get that idea?
In fact a huge amount of Harman info that is cutting edge pertains to directivity and dispersion. Not to mention the obvious attention in their products to dynamics and often SOTA lack of compression.
The Harman score is older and only part of the research.
I think the fact it computes a number is why so many focus on it to much. Yah a lot of those cats are a bit Dunning-Kruger, and need to go deeper.

PeteL · Aug 17, 2021

CtheArgie said:
Actually, in medicine there are lots of examples when studies with small samples have reported results that have not been confirmed in larger ones.

For example, the remission rate in Crohn's disease with infliximab in the study by Targan was done with 28 patients per arm. The larger studies coming after were never ever able to even get close to the remission rate of this one.

To this day, most physicians believe and quote the Targan study despite this fact. Infliximab was even originally approved by the FDA based on this study

That is correct, but. Remember we are not saving lives here, those blind test experiments will always be anecdotal, but it's the sum of em all that can show paterns. Budgets for large scales studies won't happen. It's simply not important enough. Audiophilia is a hobby, we have to remind ourselves that... Now yes there is an exception, and it appears to be Harman and Olive's work. That's the data we have, anything else cannot be taken to reach rigourous scientific conclusions. Now even Harman's, it could also be argued that it's not rigorous enough, and it's a fair point that the one conducting study is the same that sells us stock. That's right. But it's the only real data we have.

CtheArgie · Aug 17, 2021

@PeteL, this area has been a bit of a puzzle to me. I took in graduate school Advanced Market Research with Paul Greene, who invented most of the modern research techniques including conjoint analysis and other "preference" tools. I also found fascinating that Kahnemann's work on utilities seems to parallel Greene's work on conjoint. The course was very heavy on stats. But digressing. Toole works involves "preference" by trained listeners and makes a few assumptions. They may be correct, but there also could be some variables. As he states, the preference score is not completely predictable. That 76% (I am speaking from memory and apologize if number are not correct) could explain that the balance is what we "can't explain".

We have to admit that from "accuracy" in an anechoic chamber to "preference" in a simulated room could have variances that we don't completely understand. Or that may require more research. And as you say, no manufacturer will attempt to kill the goose by doing more work.

preload · Aug 17, 2021

CtheArgie said:
That 76% (I am speaking from memory and apologize if number are not correct) could explain that the balance is what we "can't explain".

We have to admit that from "accuracy" in an anechoic chamber to "preference" in a simulated room could have variances that we don't completely understand. Or that may require more research. And as you say, no manufacturer will attempt to kill the goose by doing more work.

This. A very hard concept for some engineer/tech folks to understand, I'm noticing (sorry guys).

Our solved understanding of how measurements predict perceived sound quality in solid state devices like amps and DACs is far from solved when it comes to transducers.

amirm · Aug 17, 2021

CtheArgie said:
Toole works involves "preference" by trained listeners and makes a few assumptions.

This is not the case. Huge number of untrained listeners have been used in countless tests. Have you not watched my latest video?

MerlinGS · Aug 17, 2021

preload said:
Our solved understanding of how measurements predict perceived sound quality in solid state devices like amps and DACs is far from solved when it comes to transducers.

I would suggest they are not measuring the same thing. Research into DACs and other electronics and related matters (e.g. cables), tries to determine what level of distortions and nonlinearities humans can generally hear/identify in a reliable and blind fashion. Basically they try to determine what level of measured performance is required for DUTs to be transparent to human hearing. An analog of speaker tests for electronics would have been to compare SS amps that are transparent with tube amps that can be described as euphonic. Under many circumstances, many individuals would state a preference for the sonic qualities of the tube amp. However, the question then would be whether electronics and cables (using filter boxes) are the right place for distortions/euphonics to be introduced in sonic reproduction (as opposed to being introduced in a controlled fashion through digital means).

Conversely, speakers are far from transparent. Speakers at the moment are an exercise in compromise (and this is without considering the effects of rooms in sound reproduction), thus testing for preferences attempts to identify what forms of engineering compromise are significant, and what less meaningful. Because we are dealing with reproduction that is far from transparent, and with humans that have different auditory capacities, preferences, and experiences, it is highly unlikely one can come up with any testing methodology that is as predictive as that used to measure transparency with electronics.

One last point, stereo reproduction, even in the context of great recordings, cannot come close to reproducing the recording event in a transparent fashion. MCh reproduction has advanced the SOTA in reproduction immensely (when recorded and reproduced properly), but there is still much work to be done to be able to reproduce events in a "transparent" fashion (assuming that is a goal).

PS If we consider rooms, room treatments, and MCh, the answers to preference as regards speakers will be invariably affected.

CtheArgie · Aug 17, 2021

amirm said:
This is not the case. Huge number of untrained listeners have been used in countless tests. Have you not watched my latest video?

Yes I have. And the Harman curve is based on trained listeners as you know. He showed the different curves For different types of listeners. And it still is a preference.

Chromatischism · Aug 17, 2021

ROOSKIE said:
Regarding preferences.
Dynamics and scale are very important to me.
I am not a bass head per say, that said bass depth and weight contribute a lot to my/many folks perception of dynamics, scale, excitement, power, room filling soundstage, realism - things I value very much.
I also do a lot of electronic music and music with bass guitars and drums.
Especially on familiar tracks, it is difficult to enjoy them as much when known moments of exciting impact are removed or drastically reduced.
( exaggerated or bloated bass is no good for me either)

The other top speaker value for me is long term listening. This is not possible to test quickly. I couldn't care less how a speaker sounds for a few tracks, I MUST be able to enjoy the sound at reasonably loud volumes for as long as I wish. There are speakers that fall off with this requirement that sound quite good initially. (Many speakers fall off here such as KEF's Q150 - something oddly tiring in the highs for me)
It also has to still sound exciting AND have that long term listening covered. (ELAC's DBR62 fell out here for me, so boring as have been several speakers I have tried)

Finally port noise kills the deal. I can not stand any port chuffing even if rare I will hear it. Any consistent port noise sends a speaker out the door.

Agreed on all points, my system is designed around the same goals. I would agree with the DBR62 comment but only if I didn't use Dynamic EQ, which gives them what they need. In that regard I don't find them lacking much. It is in fact this kind of neutral speaker that does well with loudness compensation.

BYRTT · Aug 18, 2021

Hi @MatthewS and 1000 thanks share your hard work over here

...

Should you happen like below overlay of estimated in room/PIR curves you very welcome rightclick and save the gif file and use it for post 1.

Gatordaddy · Aug 18, 2021

CtheArgie said:
Actually, in medicine there are lots of examples when studies with small samples have reported results that have not been confirmed in larger ones.

For example, the remission rate in Crohn's disease with infliximab in the study by Targan was done with 28 patients per arm. The larger studies coming after were never ever able to even get close to the remission rate of this one.

To this day, most physicians believe and quote the Targan study despite this fact. Infliximab was even originally approved by the FDA based on this study

Outcomes like this undoubtedly occur all the time. Isn't statistical significance an economic metric that statistical significance is strong enough that the outcome likely matches the thesis? Of course, even tighter criteria for statistical significance still wouldn't weed out false correlations because of methodological or statistical flaws.

Considering that a pretty modest experiment supports established science the optimist in me would rather call on more blinded experiments rather than litigating significance thresholds.

richard12511 · Aug 18, 2021

MatthewS said:
After finishing Floyd Toole’s The Acoustics and Psychoacoustics of Loudspeakers and Rooms, I became interested in conducting a blind listening test. With the help of a similarly minded friend, we put together a test and ran 12 people through it. I’ll describe our procedures and the results. I’m aware of numerous limitations; we did our best with the space and technology we had.

All of the speakers in the test have spinorama data and the electronics have been measured on ASR.

Speakers (preference score in parentheses):

Kef Q100 (5.1)
Revel W553L (5.1)
JBL Control X (2.1)
OSD AP650 (1.1)

Test Tracks:
Fast Car – Tracy Chapman
Just a Little Lovin – Shelby Lynne
Tin Pan Alley – Stevie Ray Vaughan
Morph the Cat – Donald Fagen
Hunter – Björk

Amplifier:
2x 3e Audio SY-DAP1002 (with upgraded opamps)

DAC:
2x Motu M2

Soundboard / Playback Software:
Rogue Amoeba Farrago

The test tracks were all selected from Harman’s list of recommended tracks except for Hunter. All tracks were down mixed to mono as the test was conducted with single speakers. The speakers were set up on a table in a small cluster. We used pink noise and Room EQ and a MiniDSP UMik-1 to volume match the speakers to less than 0.5db variance. The speakers were hidden behind black speaker cloth before anyone arrived. We connected both M2 interfaces to a MacBook Pro, and used virtual interfaces to route output to the four channels. Each track was configured in Farrago to point to a randomly assigned speaker. This allowed us to click a single button on any track and hear it out of one of the speakers. We could easily jump around and allow participants to compare any speaker back to back on any track.

Our listeners were untrained, though we had a few musicians in the room and two folks that have read Toole’s book and spend a lot of time doing critical listening.

Participants were asked to rate each track and speaker combination on a scale from 0-10 where 10 represented the highest audio fidelity.

Here is a photo of what the setup looked like after we unblinded and presented results back to the participants:
View attachment 147701

Here are the results of everyone that participated.

Average rating across all songs and participants:
Revel W553L: 6.6
KEF Q100: 6.2
JBL Control X: 5.4
OSD 650: 5.2

Plotted:
View attachment 147692

You can see that the Kef and Revel were preferred and that the JBL and OSD scored worse. The JBL really lacked bass and this is likely why it had such low scores. The OSD has a number of problems that can be seen on the spin data. That said, at least two participants generally preferred it.

Some interesting observations:

Hunter and Morph the Cat were the most revealing songs for these speakers and Fast Car was close behind. Tin Pan Alley was more challenging, and Just a Little Lovin was the most challenging. The research that indicates bass accounts for 30% of our preference was on display here. Speakers with poor bass had low scores.

Here is the distribution of scores by song:

Fast Car:
View attachment 147689

Just a Little Lovin:
View attachment 147694

Tin Pan Alley:
View attachment 147695

Morph the Cat:
View attachment 147696

Hunter:
View attachment 147697

If we limit the data to the three most “trained” listeners (musicians, owners of studio monitors, etc) this is the result:
View attachment 147698

I’ve included the spin images here for everyone’s reference. ASR is the source for all but the Revel which comes directly from Harman. It says W552L, but the only change to the W553L was the mounting system.

View attachment 147699
View attachment 147700

My biggest takeaways were:

What the research to date indicates is correct; spinorama data correlates well with listener preferences even with untrained listeners.

Those small deviations, and even some big deviations on frequency response are not easy to pick out in a blind comparison of multiple speakers depending on the material being played. I have no doubt a trained listener can pick stuff out better—but the differences are subtle and in the case of the Kef and Revel, it really takes a lot of data before a pattern emerges.

The more trained listeners very readily picked out the OSD as sounding terrible on most tracks, but picking a winner between the Kef and Revel was much harder. It is interesting to note that the Kef and Revel both have a preference score of 5.1. The JBL has a 2.1 and the OSD has a 1.1

One final note, all but one listener was between 35-47 years old and it is unlikely any of us could hear much past 15 or 16khz. One participant was in their mid 20s.

EDIT:

Raw data is here.

@Semla conducted some statistical analysis here, and here.

@Semla gave permission to reproduce this chart from one of those posts. Please read the above posts for detail and background on this chart.

"You can compare two speakers by checking for overlap between the red arrows.

If the arrows overlap: no significant difference between the rating given by the listeners.

If they don't overlap: statistically significant difference between the rating given by the listeners."

View attachment 147884

Wow. This is amazing work! This is the kinda content we need!

amirm · Aug 18, 2021

CtheArgie said:
Yes I have. And the Harman curve is based on trained listeners as you know. He showed the different curves For different types of listeners. And it still is a preference.

Preference is involved in the overall target curve. This has nothing to do with flatness of the response of a device. You want a speaker that is flat on axis and smooth off-axis anechoically. You then use your room gain+EQ to lay out what you want the overall response from bass to treble to be in your deployment. As I explained in the video, there are no standards of production so you have no choice but to make up something here.

This has nothing to do with the comment you made that preferred response for a speaker is made out of what trained listeners wanted. Given a constant target curve, trained and untrained listeners like similar speakers.

amirm · Aug 18, 2021

CtheArgie said:
We have to admit that from "accuracy" in an anechoic chamber to "preference" in a simulated room could have variances that we don't completely understand.

This has been the topic of countless research projects and peer reviewed papers. The correlation between key aspects of objective anechoic measurements and listener preference is strong. Ignore it at your own peril just as you would some medication because it did not have 100% uniform efficacy for the entire population!

preload · Aug 18, 2021

CtheArgie said:
Actually, in medicine there are lots of examples when studies with small samples have reported results that have not been confirmed in larger ones.

Yep. The term is "medical reversal." And in medicine there are also lots of examples where the results of smaller studies ARE subsequently confirmed by larger ones. The hesitation to act and change practice based on results of the smaller studies, can in some cases, deprive patients of the benefits of an otherwise effective therapy. And even once definitive trials are published, it can often take years before it translates into common medical practice (17 years is the commonly cited figure). So let's not automatically dismiss the results of this listening test on the basis of its sample size alone - we all know that it isn't perfect, but as someone else pointed out, this is a hobby, not life and death.

uwotm8 · Aug 18, 2021

ROOSKIE said:
huge amount of Harman info that is cutting edge pertains to directivity and dispersion

But this is all about FR and its stability control

in variety of environments.

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

Major Contributor

Major Contributor

Major Contributor

Addicted to Fun and Learning

Senior Member

Major Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Founder/Admin

Active Member

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Active Member

Major Contributor

Founder/Admin

Founder/Admin

Major Contributor

Senior Member

Similar threads