• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

PeteL

Major Contributor
Joined
Jun 1, 2020
Messages
3,303
Likes
3,846
A

Yes, I completely understand the fun of this test. Should it make the front page? If you read your comment and really understand what "bias" is you will understand why without a control you have a lot of opportunity to create an exercise in expectation and confirmation bias. Entire multimillion dollar tests are often brought into question due to this sort of leaning toward a result.
The fact that the speakers corelate with the Harman score means maybe nothing without an independent variable/control and some proof that the listeners actually can pass a placebo test. Interestingly in some ways the speaker do not correlate with the Harman score depending on how you look at it. (such as the OSD speaker)
Anyway, not trying to poo-poo fun making. I do fun testing all the time, however all my comparison testing is rooted in subjective decisions due to my testing limitations and I am suggesting here that this test is more subjective than objective. They are 2 ends of the same stick and the best tests still must have subjective aspects to actually exist so having subjectivity influence the test is normal. I do think this test has quite a bit more subjective weight than idealized and it would be possible to address much of that. "tricking" the listener is one of the most common methods used. I mean literally every single pharmaceutical test uses a placebo. Maybe the word trick, triggers you. That trick is simply really important. You could use a reference instead/additional - such a a fulcrum speaker (which why the ABX is good model, it gets harder with more than two speakers though - think 3 body problem)
Sure, there could probably add this test first, to disqualify those who can't differentiate them, no problem with that, as long as it isn't used to conclude that speakers are undistinguishable, which is the shortcut that some would be tempted to do, since it's a flawed test anyway and not giving us data that conclude anything. It would just conclude that bias exists. I already know that. Now it wouldn't be a "simpler" test, at all, because at least the speakers would need to be positioned at the same position if the goal is to have listeners mistake one for the other. And if I'm correct, in ABX, the speed at which you switch matter no? If you have to redo the setup each time, you don't remember how the other one sounded. So I'm not even sure Bias could be concluded, you can't say they can't differentiate them if you've listen the other one 2 minutes ago. So a more complicated test, that don't give us any more conclusion.
 
Last edited:

PeteL

Major Contributor
Joined
Jun 1, 2020
Messages
3,303
Likes
3,846
Statistical analysis of this experiment in one easily digestible graph:

You can compare two speakers by checking for overlap between the red arrows.
  • If the arrows overlap: no significant difference between two speakers.
  • If they don't overlap: statistically significant difference between two speakers.
Technical details: the dots show the estimated marginal means obtained from the model I detailed earlier in this thread, score by speaker, adjusted for song (fixed effect) and listener (random effect). Arrows show Bonferroni-adjusted intervals.
Close but not really. That would mean no significant data to confirm the superiority of one over the other, not mean anything about the difference, only ABX can do that.
 

MCH

Major Contributor
Joined
Apr 10, 2021
Messages
2,641
Likes
2,251
Statistical analysis of this experiment in one easily digestible graph:
View attachment 147840
You can compare two speakers by checking for overlap between the red arrows.
  • If the arrows overlap: no significant difference between two speakers.
  • If they don't overlap: statistically significant difference between two speakers.
Technical details: the dots show the estimated marginal means obtained from the model I detailed earlier in this thread, score by speaker, adjusted for song (fixed effect) and listener (random effect). Arrows show Bonferroni-adjusted intervals.
Awesome! I would just suggest to rephrase as "no significant difference between the rating given by the listeners" as if I understand it well, the listeners were asked to rate the sound, not if they could hear a difference.
I know the first implies the second, but the second does need to imply the first (=two speakers can sound very different but the listener likes both the same and gives the same points).
Just a minimal semantics detail, but the difference is significant ;)
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
Close but not really. That would mean no significant data to confirm the superiority of one over the other, not mean anything about the difference, only ABX can do that.
It actually does show that significantly different scores were assigned to certain speakers, after controlling for song and listener. I did not test for superiority, nor for preference.
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
Awesome! I would just suggest to rephrase as "no significant difference between the rating given by the listeners" as if I understand it well, the listeners were asked to rate the sound, not if they could hear a difference.
I know the first implies the second, but the second does need to imply the first (=two speakers can sound very different but the listener likes both the same and gives the same points).
Just a minimal semantics detail, but the difference is significant ;)
good point - I've clarified the wording.
 
  • Like
Reactions: MCH

PeteL

Major Contributor
Joined
Jun 1, 2020
Messages
3,303
Likes
3,846
It actually does show that significantly different scores were assigned to certain speakers, after controlling for song and listener. I did not test for superiority, nor for preference.
Not sure what you mean, ok significantly "different scores" But subjective superiority , or preference is what is being scored!
 

chych7

Active Member
Joined
Aug 28, 2020
Messages
276
Likes
422
Sounds like a follow-up test with a subwoofer would add more insight, since bass was a significant decider of preference. I wonder how strongly those speaker preference scores will correlate to actual preference, and relate to $$ per speaker...
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
Not sure what you mean, ok significantly "different scores" But subjective superiority , or preference is what is being scored!
As an example, take the estimated marginal mean score for OSD and compare that with the estimated marginal mean score for the Revel. Conditional on this model an "average" listener (i.e. someone similar to the listeners in this panel) would assign the Revel a score that is about 1.4 points higher than the score they assigned to the OSD, if the same song was being played. That difference is statistically significant, even after correcting for multiple testing.
 

PeteL

Major Contributor
Joined
Jun 1, 2020
Messages
3,303
Likes
3,846
As an example, take the estimated marginal mean score for OSD and compare that with the estimated marginal mean score for the Revel. Conditional on this model an "average" listener (i.e. someone similar to the listeners in this panel) would assign the Revel a score that is about 1.4 points higher than the score they assigned to the OSD, if the same song was being played. That difference is statistically significant, even after correcting for multiple testing.
I did not suggest otherwise, I was debating the term "difference in speakers" that was ambiguous.
edit: looks like you corrected it.
 

Josq

Member
Joined
Aug 11, 2020
Messages
69
Likes
79
Looks like we have some skilled statisticians here. Would be nice to have some intuitive measures. Based on these results, given the same setup, given that we have heard none of them before, etc.

For example, for each of the speakers, what's the chance that I would rate the speaker as the best of the 4? In other words, what % of people would choose the KEF, Revel, JBL, OSD respectively, based on their own listening only?
 
Last edited:

PeteL

Major Contributor
Joined
Jun 1, 2020
Messages
3,303
Likes
3,846
Looks like we have some skilled statisticians here. Would be nice to have some intuitive measures. Based on these results, given the same setup, given that we have heard none of them before, etc.

For example, for each of the speakers, what's the chance that I would rate the speaker as the best of the 4? In other words, what % of people would choose the KEF, Revel, JBL, OSD respectively, based on their own listening only?
I'll let the statistician talk too, and they can correct me if I'm wrong, I'm an engineer that did just bit of that, but I can already tell you you can't have these numbers, the "population" is not significant.
 

ROOSKIE

Major Contributor
Joined
Feb 27, 2020
Messages
1,934
Likes
3,517
Location
Minneapolis
You forget the overriding "subjectivity test" because at the end of the day, after weeks of research and listening tests, we end up with a speaker short list we are so proud of only to hear our significant other simply say "they're all ugly, can you find something prettier?" Best to start with a list of pretty speakers, let the real decision maker select the top 3, then among those we apply our research and listening skills.
Not my world.
My GF loves all the gear and she is pretty bad ass so deff often prefers industrial looking gear and black.
 

ROOSKIE

Major Contributor
Joined
Feb 27, 2020
Messages
1,934
Likes
3,517
Location
Minneapolis
Sure, there could probably add this test first, to disqualify those who can't differentiate them, no problem with that, as long as it isn't used to conclude that speakers are undistinguishable, which is the shortcut that some would be tempted to do, since it's a flawed test anyway and not giving us data that conclude anything. It would just conclude that bias exists. I already know that. Now it wouldn't be a "simpler" test, at all, because at least the speakers would need to be positioned at the same position if the goal is to have listeners mistake one for the other. And if I'm correct, in ABX, the speed at which you switch matter no? If you have to redo the setup each time, you don't remember how the other one sounded. So I'm not even sure Bias could be concluded, you can't say they can't differentiate them if you've listen the other one 2 minutes ago. So a more complicated test, that don't give us any more conclusion.
ABX is where X is also either A or B. You are looking to see if the listeners can accurately chose if X matches A or B. Speed is a factor but really not an issue if slow, in fact you can allow the listener to have control over the speed and the volume in the test. ABX is not AB.
Don't get me wrong it(ABX) is not a great way to test a large sample of speakers in a short time.
Really doing the multiple speaker test with a control/"trick" would be the way to test a 4 or 5 speaker set while having a pretty accurate data set for ranking them.
 

beefkabob

Major Contributor
Forum Donor
Joined
Apr 18, 2019
Messages
1,652
Likes
2,093
Now do it all again, but this time with a 24-db cutoff at 100hz.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
95
Likes
862
Location
Greater Seattle
Did the listeners know what the selection of speakers under test was beforehand?

2 of the 12 listeners did. The two of us that organized it knew the speakers going in--but we were careful to randomize it so neither of us knew what was what. I connected them all up and did the virtual routing and then covered them behind the blind. My partner then used a random number generator to randomize everything. Inside the software they were just labeled A, B, C, D--but he didn't know what was what. "Speaker 1" for Fast Car and "Speaker 1" for Just a Little Lovin were randomized, they may have been the same and they may have been different.
 
Top Bottom