Shortly after completing the first blind listening test, @Inverse_Laplace and I started thinking about all the ways we’d like to improve the rigor and explore other questions. Written summary follows, but here is a video if you prefer that medium:
Unless noted below, we used the same equipment, controls, and procedures as last time, review that post for details.
We only had time to conduct the listening test with a small number of people and ended up having to toss out data on three individuals. The test was underpowered. We did not achieve statistical significance (p-value < .05). That said, here are the results we collected:
Our biggest takeaways were:
Once you get into well-behaving studio monitors, it becomes extremely difficult to tease apart the differences. It takes a lot of listening and tracks that excite small issues in each speaker. A preference score of 4 vs 6 appears to be a significant difference but depending on the nature of the flaws it can be extremely challenging to hear the difference. It is easy to hear that the speakers sound different but picking out the better speaker gets very difficult.
Running a well-controlled experiment is extremely difficult. We had to measure groups on different days and getting the level matching and all the bugs worked out was a challenge. We learned a lot and will apply it to our next set of tests.
Comments from the individual that ran the statistical analysis:
A repeated measures analysis of variance (ANOVA) found no significant difference in sound ratings for the 5 different speaker types, F(4, 16) = 1.68, p = .205, partial eta-squared = .295.
Paired samples t-tests were then run to compare the average sound ratings between each possible pair of speakers. For the most part, speakers showed no significant differences in sound ratings, ps > .12. However, there was a significant difference between sound ratings for the JBL versus EdifierEQ speakers, t(4) = 3.88, p = .018, such that participants reported significantly better sound ratings for the JBL speaker (M = 6.18, SE = 0.31) over the EdifierEQ speaker (M = 5.64, SE = 0.40).
An interesting observation: for one group of listeners, we had to level match the speakers again and in our haste, we used pink noise instead of the actual material. This excites all frequencies equally which isn’t necessarily representative of the musical selections. The Neumann KH80 was a full 3db lower (ITU R 1770) when using the music tracks than most of the other speakers (we measured after the test and we clearly could hear differences in the volume of each speaker.) We threw out this data for our analysis, but the speaker with the lowest level was universally given awful ratings by each listener.
We are looking to conduct another test with a larger group, possibly this spring.
EDIT:
REW In-Room Measurements
Attached raw data of listener preference for anyone that wants to look at it.
Speakers (calculated preference score in parentheses):
- Neuman KH80 (6.2)
- JBL 305P Mark II (5.2)
- RCF Arya Pro5 (3.9)
- Edifier R1280T (2.1)
- Edifier R1280T w/ EQ (4.7)
Test Tracks:
- Fast Car – Tracy Chapman
- Bird on a Wire – Jennifer Warnes
- I Can See Clearly Now – Holly Cole
- Hunter – Björk
- Die Parade der Zinnsoldaten – Leon Jessel (Dallas Wind Symphony)
Unless noted below, we used the same equipment, controls, and procedures as last time, review that post for details.
- Motorized turntable: 1.75s switch time between any two speakers
- ITU R 1770 loudness instead of C weighting
- Significantly larger listening room
- 5 powered bookshelf/monitors (preference ratings from 2.1 to 6.2)
- Room measurements of each speaker at multiple listening position
We only had time to conduct the listening test with a small number of people and ended up having to toss out data on three individuals. The test was underpowered. We did not achieve statistical significance (p-value < .05). That said, here are the results we collected:
Spinorama of speakers:
In-room response plotted against estimated:
Our biggest takeaways were:
- Recruit a larger cohort
- Schedule on a weekend
- Well controlled experiments are hard
Once you get into well-behaving studio monitors, it becomes extremely difficult to tease apart the differences. It takes a lot of listening and tracks that excite small issues in each speaker. A preference score of 4 vs 6 appears to be a significant difference but depending on the nature of the flaws it can be extremely challenging to hear the difference. It is easy to hear that the speakers sound different but picking out the better speaker gets very difficult.
Running a well-controlled experiment is extremely difficult. We had to measure groups on different days and getting the level matching and all the bugs worked out was a challenge. We learned a lot and will apply it to our next set of tests.
Comments from the individual that ran the statistical analysis:
A repeated measures analysis of variance (ANOVA) found no significant difference in sound ratings for the 5 different speaker types, F(4, 16) = 1.68, p = .205, partial eta-squared = .295.
Paired samples t-tests were then run to compare the average sound ratings between each possible pair of speakers. For the most part, speakers showed no significant differences in sound ratings, ps > .12. However, there was a significant difference between sound ratings for the JBL versus EdifierEQ speakers, t(4) = 3.88, p = .018, such that participants reported significantly better sound ratings for the JBL speaker (M = 6.18, SE = 0.31) over the EdifierEQ speaker (M = 5.64, SE = 0.40).
An interesting observation: for one group of listeners, we had to level match the speakers again and in our haste, we used pink noise instead of the actual material. This excites all frequencies equally which isn’t necessarily representative of the musical selections. The Neumann KH80 was a full 3db lower (ITU R 1770) when using the music tracks than most of the other speakers (we measured after the test and we clearly could hear differences in the volume of each speaker.) We threw out this data for our analysis, but the speaker with the lowest level was universally given awful ratings by each listener.
We are looking to conduct another test with a larger group, possibly this spring.
EDIT:
REW In-Room Measurements
Attached raw data of listener preference for anyone that wants to look at it.
Attachments
Last edited: