Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

MatthewS · Aug 16, 2021

After finishing @Floyd Toole 's The Acoustics and Psychoacoustics of Loudspeakers and Rooms, I became interested in conducting a blind listening test. With the help of a similarly minded friend, we put together a test and ran 12 people through it. I’ll describe our procedures and the results. I’m aware of numerous limitations; we did our best with the space and technology we had.

All of the speakers in the test have spinorama data and the electronics have been measured on ASR.

Speakers (preference score in parentheses):

Kef Q100 (5.1)
Revel W553L (5.1)
JBL Control X (2.1)
OSD AP650 (1.1)

Test Tracks:
Fast Car – Tracy Chapman
Just a Little Lovin – Shelby Lynne
Tin Pan Alley – Stevie Ray Vaughan
Morph the Cat – Donald Fagen
Hunter – Björk

Amplifier:
2x 3e Audio SY-DAP1002 (with upgraded opamps)

DAC:
2x Motu M2

Soundboard / Playback Software:
Rogue Amoeba Farrago

The test tracks were all selected from Harman’s list of recommended tracks except for Hunter. All tracks were down mixed to mono as the test was conducted with single speakers. The speakers were set up on a table in a small cluster. We used pink noise and Room EQ and a MiniDSP UMik-1 to volume match the speakers to less than 0.5db variance. The speakers were hidden behind black speaker cloth before anyone arrived. We connected both M2 interfaces to a MacBook Pro, and used virtual interfaces to route output to the four channels. Each track was configured in Farrago to point to a randomly assigned speaker. This allowed us to click a single button on any track and hear it out of one of the speakers. We could easily jump around and allow participants to compare any speaker back to back on any track.

Our listeners were untrained, though we had a few musicians in the room and two folks that have read Toole’s book and spend a lot of time doing critical listening.

Participants were asked to rate each track and speaker combination on a scale from 0-10 where 10 represented the highest audio fidelity.

Here is a photo of what the setup looked like after we unblinded and presented results back to the participants:

Here are the results of everyone that participated.

Average rating across all songs and participants:
Revel W553L: 6.6
KEF Q100: 6.2
JBL Control X: 5.4
OSD 650: 5.2

Plotted:

You can see that the Kef and Revel were preferred and that the JBL and OSD scored worse. The JBL really lacked bass and this is likely why it had such low scores. The OSD has a number of problems that can be seen on the spin data. That said, at least two participants generally preferred it.

Some interesting observations:

Hunter and Morph the Cat were the most revealing songs for these speakers and Fast Car was close behind. Tin Pan Alley was more challenging, and Just a Little Lovin was the most challenging. The research that indicates bass accounts for 30% of our preference was on display here. Speakers with poor bass had low scores.

Here is the distribution of scores by song:

Fast Car:

Just a Little Lovin:

Tin Pan Alley:

Morph the Cat:

Hunter:

If we limit the data to the three most “trained” listeners (musicians, owners of studio monitors, etc) this is the result:

I’ve included the spin images here for everyone’s reference. ASR is the source for all but the Revel which comes directly from Harman. It says W552L, but the only change to the W553L was the mounting system.

My biggest takeaways were:

What the research to date indicates is correct; spinorama data correlates well with listener preferences even with untrained listeners.

Those small deviations, and even some big deviations on frequency response are not easy to pick out in a blind comparison of multiple speakers depending on the material being played. I have no doubt a trained listener can pick stuff out better—but the differences are subtle and in the case of the Kef and Revel, it really takes a lot of data before a pattern emerges.

The more trained listeners very readily picked out the OSD as sounding terrible on most tracks, but picking a winner between the Kef and Revel was much harder. It is interesting to note that the Kef and Revel both have a preference score of 5.1. The JBL has a 2.1 and the OSD has a 1.1

One final note, all but one listener was between 35-47 years old and it is unlikely any of us could hear much past 15 or 16khz. One participant was in their mid 20s.

EDIT:

Raw data is here.

@Semla conducted some statistical analysis here, and here.

@Semla gave permission to reproduce this chart from one of those posts. Please read the above posts for detail and background on this chart.

"You can compare two speakers by checking for overlap between the red arrows.

If the arrows overlap: no significant difference between the rating given by the listeners.
If they don't overlap: statistically significant difference between the rating given by the listeners."

GD Fan · Aug 16, 2021

Sounds like a lot of fun. Nicely done.

How much of a role did that champagne coupe in front of the KEF play???

MatthewS · Aug 16, 2021

GD Fan said:
How much of a role did that champagne coupe in front of the KEF play???

It was mostly involved in the unblinding and resulting discussion of results. I realized after the entire thing was done that I didn't really take enough photographs during the process. That ended up being the only photo I had.

Newman · Aug 16, 2021

MatthewS said:
We used pink noise and Room EQ and a MiniDSP UMik-1 to volume match the speakers to less than 0.5db variance.

I'm interested in a little more detail about this, please.

Was the volume/SPL, that you measured-and-matched, A weighted or unweighted?

Were the speakers really in different locations while being auditioned, as shown in the photo?

cheers and thanks for posting your report.

MatthewS · Aug 16, 2021

We used C weighting for the meter.

Yes the way the speakers are arranged in the photo is how they were behind the blind material. It would’ve been too challenging to try to rearrange them between each track that we were testing.

amirm · Aug 16, 2021

That's very well done test given the constraints you had! I will promote to home page.

amirm · Aug 16, 2021

Are you able to post the raw scores? Like to do a bit of statistical analysis on it.

MatthewS · Aug 16, 2021

amirm said:
Are you able to post the raw scores? Like to do a bit of statistical analysis on it.

Yes, let me remove the names from the excel file and I’ll post it. It’ll be later tonight.

Tks · Aug 16, 2021

Revel ftw

Great work OP

Newman · Aug 16, 2021

MatthewS said:
We used C weighting for the meter.

I'm curious as to whether A weighting wouldn't be a better choice. After all, if Speaker A had a bit more bass than Speaker B, using C weighting will mean the sensitive range 1-5k will probably be quieter for Speaker A during the listening tests. Wouldn't that (in general, if not every single time) lead to a preference for Speaker B, by dint of being set to play louder in the ear's sensitive band?

Applying my theory to the in-room responses you show, I would predict a preference order of KEF first, then Revel (close), then JBL, then OSD (with its suppressed 1-3 kHz).

That holds pretty close to your listening test result. Which means the preference order might have been due to the use of C weighting instead of A weighting for the level matching.

Interesting?

MaxBuck · Aug 16, 2021

Really good effort! Looking forward to more stuff like this here.

milosz · Aug 16, 2021

No orchestral or chamber music?

dualazmak · Aug 16, 2021

Much congratulations for your great efforts and really interesting result!

At least just for encouragement and very valuable/interesting observation for myself, the estimated in-room response shapes and slopes for highly rated KEF Q100 and Revel W553L are similar to the latest best tuned in-room (at listening position) response shape of my multichannel multi-way system which shared recently in my post here and here, except for a little bit upward slope over ca. 6 kHz in my case which intentionally configured to compensate my slight age-depending hearing decline in high range... Of course, I well understand that Fq response at listening position is only one of the many factors contributing to the total sound quality, though.

Fafnar · Aug 16, 2021

milosz said:
No orchestral or chamber music?

Hunter has some strings. Not purely classical however.

MatthewS · Aug 16, 2021

milosz said:
No orchestral or chamber music?

We debated it, but ended up leaving that out because we wanted to keep the number of tracks folks were listening to at a reasonable number. Most of the participants, while nerdy in many ways, were not going to want to rate too many tracks. It may have been a mistake to leave it out, but we ended up feeling these were a representative sample. After we completed the experiment, we would likely have changed the track list if we were going to do it again. We'd probably drop Tin Pan Alley and/or Just a Little Lovin and include something classical and Bird on a Wire. My friend and I were just sick of listening to Bird on a Wire and decided to exclude it. We felt we had to keep Fast Car in--even though that song has bored a hole in my brain. It's a great song--but I've listened to it too much to be enjoyable at this point.

MatthewS · Aug 16, 2021

Newman said:
I'm curious as to whether A weighting wouldn't be a better choice. After all, if Speaker A had a bit more bass than Speaker B, using C weighting will mean the sensitive range 1-5k will probably be quieter for Speaker A during the listening tests. Wouldn't that (in general, if not every single time) lead to a preference for Speaker B, by dint of being set to play louder in the ear's sensitive band?

It's interesting and it's something we debated as we set it up. We ended up going with C because it covered most of the range. I can state that subjectively, during the listening test I couldn't perceive a difference in volume between the speakers. They all sounded different, but I wouldn't have pegged any as sounding louder.

If you look at the ranking of the "trained" listeners, it really isn't even close. I participated (it was blind for me, the speakers were randomized by a RNG by someone else that didn't know which speaker was which behind the screen) and the OSD really stood out on most tracks as sounding wrong and bad. The JBL lacked bass and on most tracks. The Revel and the Kef were very close. Some listeners preferred the Kef and some the Revel and it varied by track.

Ultimately, I don't think we have enough data or precision that any subtle volume differences at different frequencies significantly influenced the result. Basically, I think the volume matching was good enough that any differences are lost in the other "noise" of the test.

If we did it again (some musician friends of mine already asked me to run it again), I'd probably try to reach out to Dr. Toole and see which weighting they use at Harman. Maybe someone on the forum knows.

MatthewS · Aug 16, 2021

dualazmak said:
At least just for encouragement and very valuable/interesting observation for myself, the estimated in-room response shapes and slopes for highly rated KEF Q100 and Revel W553L are similar to the latest best tuned in-room (at listening position) response shape of my multichannel multi-way system

It does align with the research that a in-room response that looks like that tends to be preferred.

Here is a diagram from Toole's book, that is publicly available from this AES paper:

https://www.aes.org/e-lib/browse.cfm?elib=17839

I hope it's OK to reproduce it here since it is public:

Screen Shot 2021-08-15 at 8.18.42 PM.png

There is a good thread here: https://www.audiosciencereview.com/...ut-room-curve-targets-room-eq-and-more.10950/

Newman · Aug 16, 2021

MatthewS said:
which weighting they use at Harman. Maybe someone on the forum knows.

They used A weighted. link

respice finem · Aug 16, 2021

MatthewS said:
After finishing Floyd Toole’s The Acoustics and Psychoacoustics of Loudspeakers and Rooms, I became interested in conducting a blind listening test. With the help of a similarly minded friend, we put together a test and ran 10 people through it. I’ll describe our procedures and the results. I’m aware of numerous limitations; we did our best with the space and technology we had.

All of the speakers in the test have spinorama data and the electronics have been measured on ASR.

Speakers (preference score in parentheses):

Kef Q100 (5.1)
Revel W553L (5.1)
JBL Control X (2.1)
OSD AP650 (1.1)

Test Tracks:
Fast Car – Tracy Chapman
Just a Little Lovin – Shelby Lynne
Tin Pan Alley – Stevie Ray Vaughan
Morph the Cat – Donald Fagen
Hunter – Björk

Amplifier:
2x 3e Audio SY-DAP1002 (with upgraded opamps)

DAC:
2x Motu M2

Soundboard / Playback Software:
Rogue Amoeba Farrago

The test tracks were all selected from Harman’s list of recommended tracks except for Hunter. All tracks were down mixed to mono as the test was conducted with single speakers. The speakers were set up on a table in a small cluster. We used pink noise and Room EQ and a MiniDSP UMik-1 to volume match the speakers to less than 0.5db variance. The speakers were hidden behind black speaker cloth before anyone arrived. We connected both M2 interfaces to a MacBook Pro, and used virtual interfaces to route output to the four channels. Each track was configured in Farrago to point to a randomly assigned speaker. This allowed us to click a single button on any track and hear it out of one of the speakers. We could easily jump around and allow participants to compare any speaker back to back on any track.

Our listeners were untrained, though we had a few musicians in the room and two folks that have read Toole’s book and spend a lot of time doing critical listening.

Participants were asked to rate each track and speaker combination on a scale from 0-10 where 10 represented the highest audio fidelity.

Here is a photo of what the setup looked like after we unblinded and presented results back to the participants:
View attachment 147701

Here are the results of everyone that participated.

Average rating across all songs and participants:
Revel W553L: 6.6
KEF Q100: 6.2
JBL Control X: 5.4
OSD 650: 5.2

Plotted:
View attachment 147692

You can see that the Kef and Revel were preferred and that the JBL and OSD scored worse. The JBL really lacked bass and this is likely why it had such low scores. The OSD has a number of problems that can be seen on the spin data. That said, at least two participants generally preferred it.

Some interesting observations:

Hunter and Morph the Cat were the most revealing songs for these speakers and Fast Car was close behind. Tin Pan Alley was more challenging, and Just a Little Lovin was the most challenging. The research that indicates bass accounts for 30% of our preference was on display here. Speakers with poor bass had low scores.

Here is the distribution of scores by song:

Fast Car:
View attachment 147689

Just a Little Lovin:
View attachment 147694

Tin Pan Alley:
View attachment 147695

Morph the Cat:
View attachment 147696

Hunter:
View attachment 147697

If we limit the data to the three most “trained” listeners (musicians, owners of studio monitors, etc) this is the result:
View attachment 147698

I’ve included the spin images here for everyone’s reference. ASR is the source for all but the Revel which comes directly from Harman. It says W552L, but the only change to the W553L was the mounting system.

View attachment 147699
View attachment 147700

My biggest takeaways were:

What the research to date indicates is correct; spinorama data correlates well with listener preferences even with untrained listeners.

Those small deviations, and even some big deviations on frequency response are not easy to pick out in a blind comparison of multiple speakers depending on the material being played. I have no doubt a trained listener can pick stuff out better—but the differences are subtle and in the case of the Kef and Revel, it really takes a lot of data before a pattern emerges.

The more trained listeners very readily picked out the OSD as sounding terrible on most tracks, but picking a winner between the Kef and Revel was much harder. It is interesting to note that the Kef and Revel both have a preference score of 5.1. The JBL has a 2.1 and the OSD has a 1.1

One final note, all but one listener was between 35-47 years old and it is unlikely any of us could hear much past 15 or 16khz. One participant was in their mid 20s.

Thanks very much for this. This is perhaps the most realistic way to test, if the most "work intensive".
What IMHO would be even more work intensive, but nonetheless interesting: To EQ the speakers as good as possible (FR not just level) and re-test, to see how much audible difference remains.

GaryH · Aug 16, 2021

Newman said:
They used A weighted. link

That's from a very old paper from 1993. They now use ITU-R 1770 loudness, which is similar to B-weighting. From Harman's Dr Sean Olive himself.

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

Member

Major Contributor

Member

Major Contributor

Member

Founder/Admin

Founder/Admin

Member

Major Contributor

Major Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Member

Member

Member

Member

Major Contributor

Major Contributor

Major Contributor

Similar threads