• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Blind Listening Test 2: Neumann KH 80 vs JBL 305p MkII vs Edifier R1280T vs RCF Arya Pro5

teashea

Addicted to Fun and Learning
Joined
Dec 23, 2022
Messages
695
Likes
763
Location
Nebraska
Shortly after completing the first blind listening test, @Inverse_Laplace and I started thinking about all the ways we’d like to improve the rigor and explore other questions. Written summary follows, but here is a video if you prefer that medium:

Speakers (calculated preference score in parentheses):

Test Tracks:

  1. Fast Car – Tracy Chapman
  2. Bird on a Wire – Jennifer Warnes
  3. I Can See Clearly Now – Holly Cole
  4. Hunter – Björk
  5. Die Parade der Zinnsoldaten – Leon Jessel (Dallas Wind Symphony)

Unless noted below, we used the same equipment, controls, and procedures as last time, review that post for details.
  • Motorized turntable: 1.75s switch time between any two speakers
  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • 5 powered bookshelf/monitors (preference ratings from 2.1 to 6.2)
  • Room measurements of each speaker at multiple listening position
By far the most significant improvement was the motorized turntable. We were able to rotate to any speaker in 1.75 seconds and keep the tweeter in the same location for each speaker. The control board also randomized the speakers for each track automatically and was controllable remotely from an iPad.

View attachment 275371
View attachment 275372


We only had time to conduct the listening test with a small number of people and ended up having to toss out data on three individuals. The test was underpowered. We did not achieve statistical significance (p-value < .05). That said, here are the results we collected:

View attachment 275373

Spinorama of speakers:


View attachment 275374

In-room response plotted against estimated:

View attachment 275375

Our biggest takeaways were:
  • Recruit a larger cohort
  • Schedule on a weekend
  • Well controlled experiments are hard
Some personal thoughts:

Once you get into well-behaving studio monitors, it becomes extremely difficult to tease apart the differences. It takes a lot of listening and tracks that excite small issues in each speaker. A preference score of 4 vs 6 appears to be a significant difference but depending on the nature of the flaws it can be extremely challenging to hear the difference. It is easy to hear that the speakers sound different but picking out the better speaker gets very difficult.

Running a well-controlled experiment is extremely difficult. We had to measure groups on different days and getting the level matching and all the bugs worked out was a challenge. We learned a lot and will apply it to our next set of tests.

Comments from the individual that ran the statistical analysis:
A repeated measures analysis of variance (ANOVA) found no significant difference in sound ratings for the 5 different speaker types, F(4, 16) = 1.68, p = .205, partial eta-squared = .295.

Paired samples t-tests were then run to compare the average sound ratings between each possible pair of speakers. For the most part, speakers showed no significant differences in sound ratings, ps > .12. However, there was a significant difference between sound ratings for the JBL versus EdifierEQ speakers, t(4) = 3.88, p = .018, such that participants reported significantly better sound ratings for the JBL speaker (M = 6.18, SE = 0.31) over the EdifierEQ speaker (M = 5.64, SE = 0.40).

An interesting observation: for one group of listeners, we had to level match the speakers again and in our haste, we used pink noise instead of the actual material. This excites all frequencies equally which isn’t necessarily representative of the musical selections. The Neumann KH80 was a full 3db lower (ITU R 1770) when using the music tracks than most of the other speakers (we measured after the test and we clearly could hear differences in the volume of each speaker.) We threw out this data for our analysis, but the speaker with the lowest level was universally given awful ratings by each listener.

We are looking to conduct another test with a larger group, possibly this spring.

EDIT:

REW In-Room Measurements
Attached raw data of listener preference for anyone that wants to look at it.


very interesting - nice effort and thought went into that
 

teashea

Addicted to Fun and Learning
Joined
Dec 23, 2022
Messages
695
Likes
763
Location
Nebraska
There are lots of small issues, suggestions, criticisms that people have of your testing method ------- But that should in no way diminish what you have done in this testing. What you have done is excellent.
 

Miguelón

Member
Joined
Mar 12, 2024
Messages
39
Likes
14
Location
Vigo (Galicia, Spain)
Shortly after completing the first blind listening test, @Inverse_Laplace and I started thinking about all the ways we’d like to improve the rigor and explore other questions. Written summary follows, but here is a video if you prefer that medium:

Speakers (calculated preference score in parentheses):

Test Tracks:

  1. Fast Car – Tracy Chapman
  2. Bird on a Wire – Jennifer Warnes
  3. I Can See Clearly Now – Holly Cole
  4. Hunter – Björk
  5. Die Parade der Zinnsoldaten – Leon Jessel (Dallas Wind Symphony)

Unless noted below, we used the same equipment, controls, and procedures as last time, review that post for details.
  • Motorized turntable: 1.75s switch time between any two speakers
  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • 5 powered bookshelf/monitors (preference ratings from 2.1 to 6.2)
  • Room measurements of each speaker at multiple listening position
By far the most significant improvement was the motorized turntable. We were able to rotate to any speaker in 1.75 seconds and keep the tweeter in the same location for each speaker. The control board also randomized the speakers for each track automatically and was controllable remotely from an iPad.

View attachment 275371
View attachment 275372


We only had time to conduct the listening test with a small number of people and ended up having to toss out data on three individuals. The test was underpowered. We did not achieve statistical significance (p-value < .05). That said, here are the results we collected:

View attachment 275373

Spinorama of speakers:


View attachment 275374

In-room response plotted against estimated:

View attachment 275375

Our biggest takeaways were:
  • Recruit a larger cohort
  • Schedule on a weekend
  • Well controlled experiments are hard
Some personal thoughts:

Once you get into well-behaving studio monitors, it becomes extremely difficult to tease apart the differences. It takes a lot of listening and tracks that excite small issues in each speaker. A preference score of 4 vs 6 appears to be a significant difference but depending on the nature of the flaws it can be extremely challenging to hear the difference. It is easy to hear that the speakers sound different but picking out the better speaker gets very difficult.

Running a well-controlled experiment is extremely difficult. We had to measure groups on different days and getting the level matching and all the bugs worked out was a challenge. We learned a lot and will apply it to our next set of tests.

Comments from the individual that ran the statistical analysis:
A repeated measures analysis of variance (ANOVA) found no significant difference in sound ratings for the 5 different speaker types, F(4, 16) = 1.68, p = .205, partial eta-squared = .295.

Paired samples t-tests were then run to compare the average sound ratings between each possible pair of speakers. For the most part, speakers showed no significant differences in sound ratings, ps > .12. However, there was a significant difference between sound ratings for the JBL versus EdifierEQ speakers, t(4) = 3.88, p = .018, such that participants reported significantly better sound ratings for the JBL speaker (M = 6.18, SE = 0.31) over the EdifierEQ speaker (M = 5.64, SE = 0.40).

An interesting observation: for one group of listeners, we had to level match the speakers again and in our haste, we used pink noise instead of the actual material. This excites all frequencies equally which isn’t necessarily representative of the musical selections. The Neumann KH80 was a full 3db lower (ITU R 1770) when using the music tracks than most of the other speakers (we measured after the test and we clearly could hear differences in the volume of each speaker.) We threw out this data for our analysis, but the speaker with the lowest level was universally given awful ratings by each listener.

We are looking to conduct another test with a larger group, possibly this spring.

EDIT:

REW In-Room Measurements
Attached raw data of listener preference for anyone that wants to look at it.
You need a bigger population, as primary results look interesting but you’re not achieving correlation.

Another thing is, as I mentioned in your YouTube channel, the room acoustics seem to dominate the sound, specially around 200-300 Hz (table reflexions?).

That homogenize the sound of whatever the speaker specially in the case of monitors. Anechoic chamber will cost a tremendous amount of money but consider changing placement and use floor stands.

They will not rotate but you can take sometime to band eyes of the subjects and change the speakers by hand.

Notice that Edifier, which shows the more deviated f. response have the worst evaluation. Even if null hypothesis cannot be strongly negated by results, is a beginning…
 
Top Bottom