Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

TheHighContemplator · Aug 16, 2021

Excellent work, @amirm .

preload · Aug 16, 2021

Based on responses thus far, I am 95% confident that 15% of ASR members understand basic statistics and 90% do not.

PeteL · Aug 16, 2021

ROOSKIE said:
A

Yes, I completely understand the fun of this test. Should it make the front page? If you read your comment and really understand what "bias" is you will understand why without a control you have a lot of opportunity to create an exercise in expectation and confirmation bias. Entire multimillion dollar tests are often brought into question due to this sort of leaning toward a result.
The fact that the speakers corelate with the Harman score means maybe nothing without an independent variable/control and some proof that the listeners actually can pass a placebo test. Interestingly in some ways the speaker do not correlate with the Harman score depending on how you look at it. (such as the OSD speaker)
Anyway, not trying to poo-poo fun making. I do fun testing all the time, however all my comparison testing is rooted in subjective decisions due to my testing limitations and I am suggesting here that this test is more subjective than objective. They are 2 ends of the same stick and the best tests still must have subjective aspects to actually exist so having subjectivity influence the test is normal. I do think this test has quite a bit more subjective weight than idealized and it would be possible to address much of that. "tricking" the listener is one of the most common methods used. I mean literally every single pharmaceutical test uses a placebo. Maybe the word trick, triggers you. That trick is simply really important. You could use a reference instead/additional - such a a fulcrum speaker (which why the ABX is good model, it gets harder with more than two speakers though - think 3 body problem)

Sure, there could probably add this test first, to disqualify those who can't differentiate them, no problem with that, as long as it isn't used to conclude that speakers are undistinguishable, which is the shortcut that some would be tempted to do, since it's a flawed test anyway and not giving us data that conclude anything. It would just conclude that bias exists. I already know that. Now it wouldn't be a "simpler" test, at all, because at least the speakers would need to be positioned at the same position if the goal is to have listeners mistake one for the other. And if I'm correct, in ABX, the speed at which you switch matter no? If you have to redo the setup each time, you don't remember how the other one sounded. So I'm not even sure Bias could be concluded, you can't say they can't differentiate them if you've listen the other one 2 minutes ago. So a more complicated test, that don't give us any more conclusion.

PeteL · Aug 16, 2021

Semla said:
Statistical analysis of this experiment in one easily digestible graph:

You can compare two speakers by checking for overlap between the red arrows.

If the arrows overlap: no significant difference between two speakers.

If they don't overlap: statistically significant difference between two speakers.

Technical details: the dots show the estimated marginal means obtained from the model I detailed earlier in this thread, score by speaker, adjusted for song (fixed effect) and listener (random effect). Arrows show Bonferroni-adjusted intervals.

Close but not really. That would mean no significant data to confirm the superiority of one over the other, not mean anything about the difference, only ABX can do that.

MCH · Aug 16, 2021

Semla said:
Statistical analysis of this experiment in one easily digestible graph:
View attachment 147840
You can compare two speakers by checking for overlap between the red arrows.

If the arrows overlap: no significant difference between two speakers.

If they don't overlap: statistically significant difference between two speakers.

Technical details: the dots show the estimated marginal means obtained from the model I detailed earlier in this thread, score by speaker, adjusted for song (fixed effect) and listener (random effect). Arrows show Bonferroni-adjusted intervals.

Awesome! I would just suggest to rephrase as "no significant difference between the rating given by the listeners" as if I understand it well, the listeners were asked to rate the sound, not if they could hear a difference.
I know the first implies the second, but the second does need to imply the first (=two speakers can sound very different but the listener likes both the same and gives the same points).
Just a minimal semantics detail, but the difference is significant

Semla · Aug 16, 2021

PeteL said:
Close but not really. That would mean no significant data to confirm the superiority of one over the other, not mean anything about the difference, only ABX can do that.

It actually does show that significantly different scores were assigned to certain speakers, after controlling for song and listener. I did not test for superiority, nor for preference.

Semla · Aug 16, 2021

MarcosCh said:
Awesome! I would just suggest to rephrase as "no significant difference between the rating given by the listeners" as if I understand it well, the listeners were asked to rate the sound, not if they could hear a difference.
I know the first implies the second, but the second does need to imply the first (=two speakers can sound very different but the listener likes both the same and gives the same points).
Just a minimal semantics detail, but the difference is significant

good point - I've clarified the wording.

PeteL · Aug 16, 2021

Semla said:
It actually does show that significantly different scores were assigned to certain speakers, after controlling for song and listener. I did not test for superiority, nor for preference.

Not sure what you mean, ok significantly "different scores" But subjective superiority , or preference is what is being scored!

chych7 · Aug 16, 2021

Sounds like a follow-up test with a subwoofer would add more insight, since bass was a significant decider of preference. I wonder how strongly those speaker preference scores will correlate to actual preference, and relate to $$ per speaker...

Semla · Aug 16, 2021

PeteL said:
Not sure what you mean, ok significantly "different scores" But subjective superiority , or preference is what is being scored!

As an example, take the estimated marginal mean score for OSD and compare that with the estimated marginal mean score for the Revel. Conditional on this model an "average" listener (i.e. someone similar to the listeners in this panel) would assign the Revel a score that is about 1.4 points higher than the score they assigned to the OSD, if the same song was being played. That difference is statistically significant, even after correcting for multiple testing.

PeteL · Aug 16, 2021

Semla said:
As an example, take the estimated marginal mean score for OSD and compare that with the estimated marginal mean score for the Revel. Conditional on this model an "average" listener (i.e. someone similar to the listeners in this panel) would assign the Revel a score that is about 1.4 points higher than the score they assigned to the OSD, if the same song was being played. That difference is statistically significant, even after correcting for multiple testing.

I did not suggest otherwise, I was debating the term "difference in speakers" that was ambiguous.
edit: looks like you corrected it.

Josq · Aug 16, 2021

Looks like we have some skilled statisticians here. Would be nice to have some intuitive measures. Based on these results, given the same setup, given that we have heard none of them before, etc.

For example, for each of the speakers, what's the chance that I would rate the speaker as the best of the 4? In other words, what % of people would choose the KEF, Revel, JBL, OSD respectively, based on their own listening only?

PeteL · Aug 16, 2021

Josq said:
Looks like we have some skilled statisticians here. Would be nice to have some intuitive measures. Based on these results, given the same setup, given that we have heard none of them before, etc.

For example, for each of the speakers, what's the chance that I would rate the speaker as the best of the 4? In other words, what % of people would choose the KEF, Revel, JBL, OSD respectively, based on their own listening only?

I'll let the statistician talk too, and they can correct me if I'm wrong, I'm an engineer that did just bit of that, but I can already tell you you can't have these numbers, the "population" is not significant.

ROOSKIE · Aug 16, 2021

Spocko said:
You forget the overriding "subjectivity test" because at the end of the day, after weeks of research and listening tests, we end up with a speaker short list we are so proud of only to hear our significant other simply say "they're all ugly, can you find something prettier?" Best to start with a list of pretty speakers, let the real decision maker select the top 3, then among those we apply our research and listening skills.

Not my world.
My GF loves all the gear and she is pretty bad ass so deff often prefers industrial looking gear and black.

ROOSKIE · Aug 16, 2021

PeteL said:
Sure, there could probably add this test first, to disqualify those who can't differentiate them, no problem with that, as long as it isn't used to conclude that speakers are undistinguishable, which is the shortcut that some would be tempted to do, since it's a flawed test anyway and not giving us data that conclude anything. It would just conclude that bias exists. I already know that. Now it wouldn't be a "simpler" test, at all, because at least the speakers would need to be positioned at the same position if the goal is to have listeners mistake one for the other. And if I'm correct, in ABX, the speed at which you switch matter no? If you have to redo the setup each time, you don't remember how the other one sounded. So I'm not even sure Bias could be concluded, you can't say they can't differentiate them if you've listen the other one 2 minutes ago. So a more complicated test, that don't give us any more conclusion.

ABX is where X is also either A or B. You are looking to see if the listeners can accurately chose if X matches A or B. Speed is a factor but really not an issue if slow, in fact you can allow the listener to have control over the speed and the volume in the test. ABX is not AB.
Don't get me wrong it(ABX) is not a great way to test a large sample of speakers in a short time.
Really doing the multiple speaker test with a control/"trick" would be the way to test a 4 or 5 speaker set while having a pretty accurate data set for ranking them.

beefkabob · Aug 16, 2021

Now do it all again, but this time with a 24-db cutoff at 100hz.

MatthewS · Aug 16, 2021

GaryH said:
Did the listeners know what the selection of speakers under test was beforehand?

2 of the 12 listeners did. The two of us that organized it knew the speakers going in--but we were careful to randomize it so neither of us knew what was what. I connected them all up and did the virtual routing and then covered them behind the blind. My partner then used a random number generator to randomize everything. Inside the software they were just labeled A, B, C, D--but he didn't know what was what. "Speaker 1" for Fast Car and "Speaker 1" for Just a Little Lovin were randomized, they may have been the same and they may have been different.

MatthewS · Aug 16, 2021

Semla said:
Statistical analysis of this experiment in one easily digestible graph:

@Semla, This is great--thank you for putting the time in to help analyze the results. Do you mind if I add this graph to the original post with credit given?

Stokdoof · Aug 16, 2021

Is this test done because blind people on average have better hearing ? I cannot find anything about that in the description.

Semla · Aug 16, 2021

MatthewS said:
@Semla, This is great--thank you for putting the time in to help analyze the results. Do you mind if I add this graph to the original post with credit given?

Thanks! Feel free to use the results.

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

Active Member

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Active Member

Active Member

Major Contributor

Active Member

Active Member

Major Contributor

Member

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Member

Member

Active Member

Active Member

Similar threads