• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

richard12511

Major Contributor
Forum Donor
Joined
Jan 23, 2020
Messages
4,335
Likes
6,700
@PeteLAs he states, the preference score is not completely predictable. That 76% (I am speaking from memory and apologize if number are not correct) could explain that the balance is what we "can't explain".
.

From what I understand, there were two published tests. The larger test just pit a bunch of speakers against each other running full range. That test resulted in a .86 correlation. It was found that a big part of that variance came from differing amounts of bass. Another smaller test was run that tried to equalize bass to eliminate that variable. The second study showed a much stronger correlation(.99) between predicted and actual scores.
 
Last edited:

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,368
Likes
234,389
Location
Seattle Area
A preference score is reached by scoring various different attributes. I don't know if Toole used step wise regression or whatever. But he states that certain characteristics contribute to the preference score and that the score is not "perfect". It gives a pretty good number.
Preference score was created by dr. Sean olive, no dr. Toole. There is an entire paper on it that explains its construct. It is independent work to desired characteristics of a speaker. Don't mix them up.
 

CtheArgie

Addicted to Fun and Learning
Forum Donor
Joined
Jan 11, 2020
Messages
509
Likes
773
Location
Agoura Hills, CA.
I don't understand why many have become so defensive on this issue. I said that I accept the work of Toole and Olive. I said that I want my speakers to follow the Harman curve of educated listeners. I said that I wished all studios used the same curve, to reduce this circle of confusion.
But! We have to also accept that going from a flat response in an anechoic room to a "Harman educated listeners curve" in a simulated living room using the "preferred" curve and spinorama or using the Olive score involves a preference. This is not an AP measurement. It is a "voting" system of what educated people liked more. There is nothing wrong with that! But this "measurement" is based on opinions or preference. Ww don't have a better way of determining speaker performance (yet). We have some other aspects to explain, which was used by amirm for example to explain his "preference" of the big Revels over the Genelec (if I recall correctly). This is still science in need of a lot progress. It is easier to measure a DAC or wires and determine its "quality". I don't think this is complicated or controversial! Can we move on now?
 

Randy Bessinger

Member
Forum Donor
Joined
Apr 26, 2017
Messages
84
Likes
159
^^Could you define educated as used in the context of your post? My understanding was that trained listeners vs. untrained listeners differed only in the amount of time it took to choose which speaker they preferred. Is that incorrect?
 

CtheArgie

Addicted to Fun and Learning
Forum Donor
Joined
Jan 11, 2020
Messages
509
Likes
773
Location
Agoura Hills, CA.
@Randy Bessinger , correct, my apologies. TRAINED listeners. There is research that shows that untrained listeners had a different preferred response curve. A little more bass and more treble. As amirs has posted many times, many speaker manufacturers release speakers with what appears "untrained listeners" preference curve because they appear more attractive in the salesroom.
 

Colonel7

Addicted to Fun and Learning
Joined
Feb 22, 2020
Messages
616
Likes
875
Location
Maryland, USA
@Randy Bessinger , correct, my apologies. TRAINED listeners. There is research that shows that untrained listeners had a different preferred response curve. A little more bass and more treble. As amirs has posted many times, many speaker manufacturers release speakers with what appears "untrained listeners" preference curve because they appear more attractive in the salesroom.
I don't think this is correct but don't have the research in front of me right now to verify. Have you read the research itself? Seems you've been going by memory of forum posts or hearsay and you keep misstating things. I'll leave it to others to follow up
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,554
Likes
1,701
Location
California
From what I understand, there were two published tests. The larger test just pit a bunch of speakers against each other running full range. That test resulted in a .86 correlation. It was found that a big part of that variance came from differing amounts of bass. Another smaller test was run that tried to equalize bass to eliminate that variable. The second study showed a much stronger correlation(.99) between predicted and actual scores.

You keep citing that 0.99. What exactly correlated with an r of 0.99? Sorry it doesn't sound familiar to me and when things get repeated people start believing it.
 

richard12511

Major Contributor
Forum Donor
Joined
Jan 23, 2020
Messages
4,335
Likes
6,700
You keep citing that 0.99. What exactly correlated with an r of 0.99? Sorry it doesn't sound familiar to me and when things get repeated people start believing it.

There were two tests from what I understand. First test was with 70 loudspeakers with differing amounts of bass extension. The second test tried to control bass extension by using bookshelves with similar extension. That second test only had 13? speakers, though, but it had an r of 0.99.

I did my best to google for it(I know it's somewhere in one of those 100 page AVS threads), but couldn't find anything. @Floyd Toole and @Sean Olive are on here, so maybe they can help(or correct me if I'm wrong).
 

Sancus

Major Contributor
Forum Donor
Joined
Nov 30, 2018
Messages
2,923
Likes
7,616
Location
Canada
The premise was that a speaker with a flat frequency response in an anechoic chamber and with certain dispersion characteristics when placed in a simulated living room would be preferred by a set of "educated" listeners. A preference score is reached by scoring various different attributes. I don't know if Toole used step wise regression or whatever. But he states that certain characteristics contribute to the preference score and that the score is not "perfect". It gives a pretty good number.

You're skipping a lot of steps and different studies here. Have you actually read his book? It contains over 300 citations. The preference score is derived from this study(p1, p2) by Sean Olive and it builds upon Toole's research but other than that is not part of it. That study is NOT the source of Toole's key insights.

Also, I assume you're talking about this graph with reference to preference. There is some variation in bass and treble preference, yes, but this is relatively minor when you consider the other major key insights of the research: the requirement for even, smooth dispersion and freedom from resonances. Nobody is throwing speakers out because they have a bit too much bass, or whatever.

Honestly, the varying preferences for tonality are less a matter for speaker design and more one of many strong indicators that a complete audio system must have EQ/tone control. And if you presume that, then it only makes sense to use a curve somewhere in the middle of known preferences so that people can tune to taste with as little deviation as possible.
 

CtheArgie

Addicted to Fun and Learning
Forum Donor
Joined
Jan 11, 2020
Messages
509
Likes
773
Location
Agoura Hills, CA.
@Sancus, as I said, and I didn’t buy the papers, it WAS based on a regression model. This is how preference scores are reached. Basically, you are expanding on what I summarized, nothing more.
 

richard12511

Major Contributor
Forum Donor
Joined
Jan 23, 2020
Messages
4,335
Likes
6,700
@preload , was hoping someone would come forward with the source of the 13 bookshelf speaker blind study with .99 correlation factor, but no one has. I’ve only very recently started keeping track of Toole/Olive data in an organized manner, but I didn’t do that while reading his book or the huge 100+ pages on AVS. I’m 80% sure there are references to the study in the “How to choose a loudspeaker, what the science shows” thread, but it may take awhile to reread that.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
To add a very simple approach to what has been said already. If we consider that the song does not matter as all listeners listen to all songs the same number of times

songs.png


and we apply very simple statistics, namely CLT to find a .95 confidence interval, we get the following:

ci.png


gaussians.png


The t.test between the speakers of the two groups leads to the conclusion that the null hypothesis that the averages are drawn from the same distributions can be rejected with great confidence except for the pair JBL and KEF, that has a p-value of about 0.05, which is good but not enough for all confidence levels.

Code:
library(tidyverse)

data <- read_csv("blind_test_data.csv")

data <- data %>%
  filter(!str_detect(Listener,"\\*\\*")) %>%
  arrange(Listener,Speaker)

avg_data <- data %>%
  group_by(Speaker) %>%
  summarise(
    N = n(),
    avg_score = mean(Score),
    se = sd(Score)/sqrt(N),
    ci = se*qt(0.975,N-1)
  )

avg_data %>%
  ggplot(aes(y = avg_score,x =  reorder(Speaker,avg_score), color = Speaker)) +
    geom_point() +
    geom_errorbar(aes(ymin = avg_score - ci, ymax = avg_score + ci),width = 0.2) +
    labs(x = "SPeaker", y = "Average score")
 

Attachments

  • gaussians.png
    gaussians.png
    52.1 KB · Views: 76
Last edited:

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
If we consider that the song does not matter as all listeners listen to all songs the same number of times
Bit of nitpicking: you might want to check that, some songs occur more than others in this dataset.

The t.test between the speakers of the two groups leads to the conclusion that the null hypothesis that the averages are drawn from the same distributions can be rejected with great confidence except for the pair JBL and KEF, that has a p-value of about 0.05, which is good but not enough for all confidence levels.
You need to adjust for multiple testing if you run multiple t-tests. It's better to use an ANOVA with post-hoc correction.

In addition, a key assumption for the t-test and ANOVA is that your samples are independent which is not the case here: observations are grouped ("nested" within listeners). In other words, scores from the same listener are more closely related than scores between listeners. Typically this is modelled using a hierarchical model. In R: lme4::lmer(score ~ speaker * song + (1 | listener)).

If you were to go for the simplified approach make sure to use a paired t-test - compare the difference for each speaker pair by listener, then correct for multiple testing. The drawbacks are that you won't be able to test formally that songs matter (song x speaker interaction term), calculate the within-listener correlation, or make a formal comparison between listeners.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
Bit of nitpicking: you might want to check that, some songs occur more than others in this dataset.

I didn't check the number of songs issue. But you are right, only the Listener one tested with Pink Noise. After excluding the listeners with the different scoring scale a get the following:

Song n avg_score
<chr> <int> <dbl>
1 Fast Car 40 5.92
2 Hunter 40 5.8
3 Just a Little Lovin 40 5.92
4 Morph the Cat 40 5.72
5 Pink Noise 4 5.5
6 Tin Pan Alley 40 6.02

Listeners disliked pink noise!

Why then you said this?
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
You need to adjust for multiple testing if you run multiple t-tests. It's better to use an ANOVA with post-hoc correction.

In addition, a key assumption for the t-test and ANOVA is that your samples are independent which is not the case here: observations are grouped ("nested" within listeners). In other words, scores from the same listener are more closely related than scores between listeners. Typically this is modelled using a hierarchical model. In R: lme4::lmer(score ~ speaker * song + (1 | listener)).

If you were to go for the simplified approach make sure to use a paired t-test - compare the difference for each speaker pair by listener, then correct for multiple testing. The drawbacks are that you won't be able to test formally that songs matter (song x speaker interaction term), calculate the within-listener correlation, or make a formal comparison between listeners.

I'm not performing a linear model and this is the simplification. Therefore I'm not doing a multiple testing, just a t.test between the samples of interest, one by one.

Code:
t.test(x = data$Score[data$Speaker == "JBL Control X"], y = data$Score[data$Speaker == "KEF Q100"])

I understand this is not taking the cross relations into account, as a linear models do and therefore in contrast to what you have suggested, but again, this is the simplification. If I take all you suggest into account I'll end up doing the same as you.

A t.test requires the data to be t distributed. The independence issue you mention is for the CLT to apply, meaning that then you can say that the sample mean tend to a gaussian. Are you saying that we can't apply the CLT to this data ignoring the song and the listeners and just considering we have 40 samples for each speaker? We can try a simulation (after removing the Pink Noise song):

Code:
purrr::map_df(1:1000, ~ {
  data %>%
  group_by(Speaker) %>%
  sample_n(30) %>%
  summarise(
    avg_score = mean(Score)
  )
}) %>%
  ggplot(aes(x = avg_score,fill = Speaker)) +
    geom_density(alpha = 0.2)

clt.png


Ok, this doesn't prove anything, but at least it shows it's not such a crazy idea.

I'm not claiming that this method is better than the linear model you proposed, just adding another way that is more simple and more people can understand yet shows essentially the same results. I think there is nothing wrong with that as long as we are transparent about how this is performed. Readers will judge for themselves.
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
I'm not performing a linear model and this is the simplification. Therefore I'm not doing a multiple testing, just a t.test between the samples of interest, one by one.

You actually perform multiple comparisons. For every t-test you run a 5% risk of a false positive. Keep running enough of these and the overall risk of finding at least one false positive becomes quite high.

A t.test requires the data to be t distributed.
That is wrong: the test statistic "t" follows a t-distribution under the null hypothesis if you meet certain assumptions, not the data.

One of these assumptions is independence. These data are not independent, they are nested within listener. (which decreases the effective sample size because they are correlated). Consider an analogy: a researcher samples blood from 1 patient 40 times, vs. blood from 40 patients, 1 time each. Both procedures produce 40 blood samples, but only the second procedure produces 40 independent blood samples.

The other assumption to keep in mind is that you need to sample at random from normally distributed populations with similar variance.

I think there is nothing wrong with that as long as we are transparent about how this is performed. Readers will judge for themselves.
The p-values this method produces are too small because it underestimates the standard errors. That means that it is likely to produce more false positive results than you anticipated.

Here is some background reading:
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
Thanks for your comments.

These data are not independent, they are nested within listener.

The fact that there is a segment doesn't mean that it's meaningful. What if the they added the gender or the age of the listener? That would make the measurements dependent on the age and gender? Independence means that the next outcome of the random variable doesn't not depend on the previous. If we take only the speaker and the score, this is a legitimate set of samples.

The p-values this method produces are too small because it underestimates the standard errors. That means that it is likely to produce more false positive results than you anticipated.

I obtain the following p-values:

JBL Control X" vs Revel W553L : p-value ~ 0.003
Revel W553L vs OSD 650 : p-value ~ 0.002

Of course the results are a bit different than yours, because the assumptions are different.

The other assumption to keep in mind is that you need to sample at random from normally distributed populations with similar variance.

This is true. In fact, I'm applying the CLT because 40 is a big enough number and therefore I don't need the data come from a normal distribution, and therefore I don't care about the t distribution. The use of the t.test is just a convention as the Student's t with 40 observations is like a normal in practice.

Keep running enough of these and the overall risk of finding at least one false positive becomes quite high.

I'm not following. Each test gives the same result every time, of course. What do you mean?

As a starting point, I have enough statistics background to discard the Wikipedia as a background. I don't discard you can convince me about the opposite though.
 

dualazmak

Major Contributor
Forum Donor
Joined
Feb 29, 2020
Messages
2,820
Likes
2,950
Location
Ichihara City, Chiba Prefecture, Japan
Just for reference, if we could have some "golden standard(s)", we may evaluate the reliability of the "test" also by sensitivity/specificity analysis; in this kind of audio ABX test, however, the concept of sensitivity/specificity would not be directly applicable, right? (I have been in the field of medical imaging diagnosis R&D for long years.)
WS002422.JPG
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
Just for reference, if we could have some "golden standard(s)", we may evaluate the reliability of the "test" also by sensitivity/specificity analysis; in this kind of audio ABX test, however, the concept of sensitivity/specificity would not be directly applicable, right? (I have been in the field of medical imaging diagnosis R&D for long years.)
View attachment 149669

It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.

To clarify, this is not ABX testing we are talking about in this thread. Here we just have some scores of 4 different speakers with some segments like the song and the listener and we can look at several things.

An ABX consist of a Bernoulli experiment, in which the outcome of each trial is A or B, corresponding to a subject stating that the sound sample is he hearing corresponds to A or B. If 9 out of 10 trials, the subject gives the right answer, we can say that the probability that he has been able to differentiate the two sound samples by change is low enough so we can state that he is able to tell them apart. We know how many times has the subject to get it right before because the level of the test does not depend on the data but on the definition of the test.

The amount of false positives accepted is determined by the level of the test. But we have to be precise about what are we testing. The null hypothesis of those t.tests, whether the ones coming from a simple approach or from a linear model, is that the sample mean obtained come from the same distribution and the differences are due to chance. When we got a low p-value means that we can reject this null hypothesis knowing that the probability of a false positive is equal the p-value.

The approach of @Semla is more ambitious as it builds a model to predict future scores.

The discussion with @Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.

Please, correct me if I'm wrong.

Edit: I forgot to answer whether that coefficients you point can be applied here.

The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.
 
Last edited:
Top Bottom