Blind Listening Test 2: Neumann KH 80 vs JBL 305p MkII vs Edifier R1280T vs RCF Arya Pro5

vole-boy · Mar 28, 2023

Lovely to see an attempt at a decent experimental set up. Just a note on the statistics you used - you used ANOVA and I'm not sure that's 100% correct - it'll approximate the correct result but wouldn't stand up to peer review (at least not in my discipline). That's because ANOVAs are parametric tests that make assumptions about the underlying data that aren't actually supported when you use a likert-type preference scale. It appears that the speakers were rated on a 1-10 scale, and so the units are ordinal (in that they are ordered but the size differences between the numbers might vary - for example 2 is larger than 1 and three is larger than 2, BUT 2 is not necessarily twice the size of 1, and 3 is not necessarily three times larger than 1 etc). The correct statistics in this case are ordinal logistic regressions (you can use package ordinal by Christensen in program R for this https://cran.r-project.org/web/packages/ordinal/ordinal.pdf). These can be performed as repeated measures analysis if needed. I work a LOT with survey data in my day job, and this is the correct statistical method for data derived from preference scores. I'll stop being a smart a*se now - thanks again for all the reviewing. Much appreciated!

vole-boy · Mar 28, 2023

vole-boy said:
Lovely to see an attempt at a decent experimental set up. Just a note on the statistics you used - you used ANOVA and I'm not sure that's 100% correct - it'll approximate the correct result but wouldn't stand up to peer review (at least not in my discipline). That's because ANOVAs are parametric tests that make assumptions about the underlying data that aren't actually supported when you use a likert-type preference scale. It appears that the speakers were rated on a 1-10 scale, and so the units are ordinal (in that they are ordered but the size differences between the numbers might vary - for example 2 is larger than 1 and three is larger than 2, BUT 2 is not necessarily twice the size of 1, and 3 is not necessarily three times larger than 1 etc). The correct statistics in this case are ordinal logistic regressions (you can use package ordinal by Christensen in program R for this https://cran.r-project.org/web/packages/ordinal/ordinal.pdf). These can be performed as repeated measures analysis if needed. I work a LOT with survey data in my day job, and this is the correct statistical method for data derived from preference scores. I'll stop being a smart a*se now - thanks again for all the reviewing. Much appreciated!

I'm happy to supply some code, or do a bit of analysis if helpful!

DJBonoBobo · Mar 28, 2023

Very interesting, thanks!

The room is a limiting factor, though. My experience is that my own KH310 sound really bad without room treatment, but the more treatment i use, the more even subtle nuances become important.
You can see that room influences are dominating the FR of every speaker.
So I wonder if the results were clearer and differences between speakers more important in a more treated room? Still an interesting experiment, of course, because all speakers have the same conditions.

feitaishi · Mar 28, 2023

get kali lp6v2 please, if possible

jae · Mar 28, 2023

sweetchaos said:
2. High pass each speaker (say at 80hz) to eliminate the variable bass output.

Thought this as well. Would also be interested in the average age of participants in each trial, and/or the average age for participants who chose each speaker as their #1.

computer-audiophile · Mar 28, 2023

DJBonoBobo said:
Very interesting, thanks!

The room is a limiting factor, though. My experience is that my own KH310 sound really bad without room treatment, but the more treatment i use, the more even subtle nuances become important.
You can see that room influences are dominating the FR of every speaker.
So I wonder if the results were clearer and differences between speakers more important in a more treated room? Still an interesting experiment, of course, because all speakers have the same conditions.

In the direct near field, the influence of the room is relatively small. I compared my KH120 and the JBL 305p MkII in the near field, that's what they are both made for.

jae said:
Thought this as well. Would also be interested in the average age of participants in each trial, and/or the average age for participants who chose each speaker as their #1.

You could also bring in some women. (My wife, who is a trained audiophile, also finds the JBL rather better than the Neumann).

Omid · Mar 28, 2023

You may have already discussed your methodology elsewhere, so sorry if I repeat something you’ve already considered.

For the next test may I suggest having each listener rate the same speaker on 3 different occasions, in blinded fashion? So if you test 5 speakers, you’d run the test 15 times randomly choosing speakers so they each get 3 turns (but the tester isn’t aware whether he has already rated a given speaker).

It makes the test even more tedious, but it allows you to look for internal consistency. If the same listener assigns scores that vary from 4 to 6 for the same speaker, it would put the validity of the test in question. If each tester consistently score the same speaker with same approx score you can feel more confident in the results.

Perhaps this is too impractical…

JohnBooty · Mar 28, 2023

The JBL 305/306 basically snapped me out of this hobby.

Sometimes the treble sounds a little rough to me, maybe, but they are just so correct for so little money. I have a small and weirdly shaped room with a lack of ideal seating positions so their polite off-axis behavior makes them the clear winners in that room for me.

Crossed over to a sub or two, you have an Extremely Correct™ full-range system for well under $1000 and possibly under $500 if you chase sales and don't need the lowest octave.

Listened to some great systems at an audio show in 2019, expecting to rack up a case of upgrade-itis and a big credit card bill. Some systems pulled off impressive tricks that surpassed the JBLs in various ways. But my conclusion was that I'd need a bigger and better listening room to really reap the benefits and that even in a "better" room, the JBLs would still sound better off-axis.

If I ever have a surplus of money I'll think about replacing them with Genelecs that have similarly pristine off-axis behavior. But that would probably be the only contender for me.

nowonas · Mar 28, 2023

Thank you for sharing this, and for the effort of producing a good test!

A couple of things I found interesting :
-Speakers were difficult to separate from each other
-Even with a fairly good frequency response, "They all sound terrible" ( comment at the end )

I know that there are many factors ( like the room) that can contribute to the bad sound. But, personally I would find it very interesting to redo the same test, but with speakers that are not so identical. (small and ported focusing on the relatively good frequency response). It would for instance be very interesting to see how an active DSP controlled, closed box speaker having less group delay and for instance with a perfect step response would compare to the more "traditional" speakers in this test.

This is also a fairly common critique of Floyd Toole's Research which also used very "similar" speakers when developing his preference scale. To me it would be interesting to see anyone addressing that critique and added more modern speaker designs (DSP controlled and closed) in the mix as well, which were not available or common when Floyd Toole did his research.

Koeitje · Mar 28, 2023

computer-audiophile said:
I know what you mean. And it is indeed the case that the JBL makes a better bass, according to my impression. I see it as an advantage, as I tried to describe in my older post. For example Pinao sounds much more realistic with more 'body' in my impression. After this test here, I like my little JBLs even more.

Low-end extension is very important. So much so that I think that 5" is the absolute minimum for any speaker you want to use without a subwoofer. Below that you simply don't have the extension and SPL.

DJBonoBobo · Mar 28, 2023

computer-audiophile said:
In the direct near field, the influence of the room is relatively small. I compared my KH120 and the JBL 305p MkII in the near field, that's what they are both made for.

You can see the influence of the room in the measurements, for example the peak around 300Hz.

computer-audiophile · Mar 28, 2023

DJBonoBobo said:
You can see the influence of the room in the measurements, for example the peak around 300Hz.

Yes. If you want to make it perfect...

Dennis_FL · Mar 28, 2023

I wonder if we will eventually give ChatGBT a voice?

bwdst3 · Mar 28, 2023

Thanks so much for sharing!

muslhead · Mar 28, 2023

Awesome!
Thanks for doing this and sharing

Sokel · Mar 28, 2023

nowonas said:
. It would for instance be very interesting to see how an active DSP controlled, closed box speaker having less group delay and for instance with a perfect step response would compare to the more "traditional" speakers in this test.

To me it would be interesting to see anyone addressing that critique and added more modern speaker designs (DSP controlled and closed) in the mix as well, which were not available or common when Floyd Toole did his research.

KH80 was in the test also.It has DSP and everything but room demolish it too.
The interesting think would be the same speaker corrected.

Eetu · Mar 28, 2023

Good work! Can't wait for you to do more tests

nowonas said:
an active DSP controlled, closed box speaker

Any examples? D&D, Kii, Buchardt A500 come to mind but not only are they a lot bigger but also significantly more expensive.

The Neumann KH80 is an active DSP design btw.

charleski · Mar 28, 2023

voodooless said:
High-pass all speakers at say 80~100 Hz, and see how much the low end determines the result

The one thing that pops out from looking at the in-room responses is that the JBL was managing to produce a fair amount of energy at 40Hz whereas the others are well down by that frequency. It might be worth investigating how much of an effect that has on the preference score.

Of course that means even more work, but that's science for you: every well-performed experiment just lands you with more questions to answer.

PeteL · Mar 28, 2023

Thanks much, quite an effort Indeed.
I have to admit that I did not watch the whole video yet, so maybe this is answered, but could it be possible to add the testing conditions? Listening distance, SPL, room size and characteristics are the firsts that appear relevant as data.
Thanks.

Thomas_A · Mar 28, 2023

MatthewS said:
Shortly after completing the first blind listening test, @Inverse_Laplace and I started thinking about all the ways we’d like to improve the rigor and explore other questions. Written summary follows, but here is a video if you prefer that medium:

Speakers (preference score in parentheses):

Neuman KH80 (6.2)

JBL 305P Mark II (5.2)

RCF Arya Pro5 (3.9)

Edifier R1280T (2.1)

Edifier R1280T w/ EQ (4.7)

Test Tracks:

Fast Car – Tracy Chapman

Bird on a Wire – Jennifer Warnes

I Can See Clearly Now – Holly Cole

Hunter – Björk

Die Parade der Zinnsoldaten – Leon Jessel (Dallas Wind Symphony)

Unless noted below, we used the same equipment, controls, and procedures as last time, review that post for details.

Motorized turntable: 1.75s switch time between any two speakers

ITU R 1770 loudness instead of C weighting

Significantly larger listening room

5 powered bookshelf/monitors (preference ratings from 2.1 to 6.2)

Room measurements of each speaker at multiple listening position

By far the most significant improvement was the motorized turntable. We were able to rotate to any speaker in 1.75 seconds and keep the tweeter in the same location for each speaker. The control board also randomized the speakers for each track automatically and was controllable remotely from an iPad.

View attachment 275371
View attachment 275372

We only had time to conduct the listening test with a small number of people and ended up having to toss out data on three individuals. The test was underpowered. We did not achieve statistical significance (p-value < .05). That said, here are the results we collected:

View attachment 275373

Spinorama of speakers:

View attachment 275374

In-room response plotted against estimated:
View attachment 275375

Our biggest takeaways were:

Recruit a larger cohort

Schedule on a weekend

Well controlled experiments are hard

Some personal thoughts:

Once you get into well-behaving studio monitors, it becomes extremely difficult to tease apart the differences. It takes a lot of listening and tracks that excite small issues in each speaker. A preference score of 4 vs 6 appears to be a significant difference but depending on the nature of the flaws it can be extremely challenging to hear the difference. It is easy to hear that the speakers sound different but picking out the better speaker gets very difficult.

Running a well-controlled experiment is extremely difficult. We had to measure groups on different days and getting the level matching and all the bugs worked out was a challenge. We learned a lot and will apply it to our next set of tests.

Comments from the individual that ran the statistical analysis:
A repeated measures analysis of variance (ANOVA) found no significant difference in sound ratings for the 5 different speaker types, F(4, 16) = 1.68, p = .205, partial eta-squared = .295.

Paired samples t-tests were then run to compare the average sound ratings between each possible pair of speakers. For the most part, speakers showed no significant differences in sound ratings, ps > .12. However, there was a significant difference between sound ratings for the JBL versus EdifierEQ speakers, t(4) = 3.88, p = .018, such that participants reported significantly better sound ratings for the JBL speaker (M = 6.18, SE = 0.31) over the EdifierEQ speaker (M = 5.64, SE = 0.40).

An interesting observation: for one group of listeners, we had to level match the speakers again and in our haste, we used pink noise instead of the actual material. This excites all frequencies equally which isn’t necessarily representative of the musical selections. The Neumann KH80 was a full 3db lower (ITU R 1770) when using the music tracks than most of the other speakers (we measured after the test and we clearly could hear differences in the volume of each speaker.) We threw out this data for our analysis, but the speaker with the lowest level was universally given awful ratings by each listener.

We are looking to conduct another test with a larger group, possibly this spring.

Very nice work.

Would be fun if you had the opportunity to do a binaural recording with in-ear microphones of a test session, and link the file here.

Blind Listening Test 2: Neumann KH 80 vs JBL 305p MkII vs Edifier R1280T vs RCF Arya Pro5

Member

Member

Major Contributor

Member

Major Contributor

Major Contributor

Member

Addicted to Fun and Learning

Member

Major Contributor

Major Contributor

Major Contributor

Addicted to Fun and Learning

New Member

Major Contributor

Master Contributor

Addicted to Fun and Learning

Major Contributor

Major Contributor

Major Contributor

​

Speakers (preference score in parentheses):​

Test Tracks:​

Spinorama of speakers:​

In-room response plotted against estimated:​

Similar threads

Speakers (preference score in parentheses):

Test Tracks:

Spinorama of speakers:

In-room response plotted against estimated: