Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

richard12511 · Aug 19, 2021

CtheArgie said:
@PeteLAs he states, the preference score is not completely predictable. That 76% (I am speaking from memory and apologize if number are not correct) could explain that the balance is what we "can't explain".
.

From what I understand, there were two published tests. The larger test just pit a bunch of speakers against each other running full range. That test resulted in a .86 correlation. It was found that a big part of that variance came from differing amounts of bass. Another smaller test was run that tried to equalize bass to eliminate that variable. The second study showed a much stronger correlation(.99) between predicted and actual scores.

amirm · Aug 19, 2021

CtheArgie said:
A preference score is reached by scoring various different attributes. I don't know if Toole used step wise regression or whatever. But he states that certain characteristics contribute to the preference score and that the score is not "perfect". It gives a pretty good number.

Preference score was created by dr. Sean olive, no dr. Toole. There is an entire paper on it that explains its construct. It is independent work to desired characteristics of a speaker. Don't mix them up.

CtheArgie · Aug 19, 2021

I don't understand why many have become so defensive on this issue. I said that I accept the work of Toole and Olive. I said that I want my speakers to follow the Harman curve of educated listeners. I said that I wished all studios used the same curve, to reduce this circle of confusion.
But! We have to also accept that going from a flat response in an anechoic room to a "Harman educated listeners curve" in a simulated living room using the "preferred" curve and spinorama or using the Olive score involves a preference. This is not an AP measurement. It is a "voting" system of what educated people liked more. There is nothing wrong with that! But this "measurement" is based on opinions or preference. Ww don't have a better way of determining speaker performance (yet). We have some other aspects to explain, which was used by amirm for example to explain his "preference" of the big Revels over the Genelec (if I recall correctly). This is still science in need of a lot progress. It is easier to measure a DAC or wires and determine its "quality". I don't think this is complicated or controversial! Can we move on now?

Randy Bessinger · Aug 19, 2021

^^Could you define educated as used in the context of your post? My understanding was that trained listeners vs. untrained listeners differed only in the amount of time it took to choose which speaker they preferred. Is that incorrect?

CtheArgie · Aug 19, 2021

@Randy Bessinger , correct, my apologies. TRAINED listeners. There is research that shows that untrained listeners had a different preferred response curve. A little more bass and more treble. As amirs has posted many times, many speaker manufacturers release speakers with what appears "untrained listeners" preference curve because they appear more attractive in the salesroom.

Randy Bessinger · Aug 20, 2021

^^ok thanks.

Colonel7 · Aug 20, 2021

CtheArgie said:
@Randy Bessinger , correct, my apologies. TRAINED listeners. There is research that shows that untrained listeners had a different preferred response curve. A little more bass and more treble. As amirs has posted many times, many speaker manufacturers release speakers with what appears "untrained listeners" preference curve because they appear more attractive in the salesroom.

I don't think this is correct but don't have the research in front of me right now to verify. Have you read the research itself? Seems you've been going by memory of forum posts or hearsay and you keep misstating things. I'll leave it to others to follow up

preload · Aug 20, 2021

richard12511 said:
From what I understand, there were two published tests. The larger test just pit a bunch of speakers against each other running full range. That test resulted in a .86 correlation. It was found that a big part of that variance came from differing amounts of bass. Another smaller test was run that tried to equalize bass to eliminate that variable. The second study showed a much stronger correlation(.99) between predicted and actual scores.

You keep citing that 0.99. What exactly correlated with an r of 0.99? Sorry it doesn't sound familiar to me and when things get repeated people start believing it.

richard12511 · Aug 21, 2021

preload said:
You keep citing that 0.99. What exactly correlated with an r of 0.99? Sorry it doesn't sound familiar to me and when things get repeated people start believing it.

There were two tests from what I understand. First test was with 70 loudspeakers with differing amounts of bass extension. The second test tried to control bass extension by using bookshelves with similar extension. That second test only had 13? speakers, though, but it had an r of 0.99.

I did my best to google for it(I know it's somewhere in one of those 100 page AVS threads), but couldn't find anything. @Floyd Toole and @Sean Olive are on here, so maybe they can help(or correct me if I'm wrong).

Sancus · Aug 21, 2021

CtheArgie said:
The premise was that a speaker with a flat frequency response in an anechoic chamber and with certain dispersion characteristics when placed in a simulated living room would be preferred by a set of "educated" listeners. A preference score is reached by scoring various different attributes. I don't know if Toole used step wise regression or whatever. But he states that certain characteristics contribute to the preference score and that the score is not "perfect". It gives a pretty good number.

You're skipping a lot of steps and different studies here. Have you actually read his book? It contains over 300 citations. The preference score is derived from this study(p1, p2) by Sean Olive and it builds upon Toole's research but other than that is not part of it. That study is NOT the source of Toole's key insights.

Also, I assume you're talking about this graph with reference to preference. There is some variation in bass and treble preference, yes, but this is relatively minor when you consider the other major key insights of the research: the requirement for even, smooth dispersion and freedom from resonances. Nobody is throwing speakers out because they have a bit too much bass, or whatever.

Honestly, the varying preferences for tonality are less a matter for speaker design and more one of many strong indicators that a complete audio system must have EQ/tone control. And if you presume that, then it only makes sense to use a curve somewhere in the middle of known preferences so that people can tune to taste with as little deviation as possible.

CtheArgie · Aug 21, 2021

@Sancus, as I said, and I didn’t buy the papers, it WAS based on a regression model. This is how preference scores are reached. Basically, you are expanding on what I summarized, nothing more.

richard12511 · Aug 24, 2021

@preload , was hoping someone would come forward with the source of the 13 bookshelf speaker blind study with .99 correlation factor, but no one has. I’ve only very recently started keeping track of Toole/Olive data in an organized manner, but I didn’t do that while reading his book or the huge 100+ pages on AVS. I’m 80% sure there are references to the study in the “How to choose a loudspeaker, what the science shows” thread, but it may take awhile to reread that.

xaviescacs · Aug 25, 2021

To add a very simple approach to what has been said already. If we consider that the song does not matter as all listeners listen to all songs the same number of times

and we apply very simple statistics, namely CLT to find a .95 confidence interval, we get the following:

The t.test between the speakers of the two groups leads to the conclusion that the null hypothesis that the averages are drawn from the same distributions can be rejected with great confidence except for the pair JBL and KEF, that has a p-value of about 0.05, which is good but not enough for all confidence levels.

Code:

library(tidyverse)

data <- read_csv("blind_test_data.csv")

data <- data %>%
  filter(!str_detect(Listener,"\\*\\*")) %>%
  arrange(Listener,Speaker)

avg_data <- data %>%
  group_by(Speaker) %>%
  summarise(
    N = n(),
    avg_score = mean(Score),
    se = sd(Score)/sqrt(N),
    ci = se*qt(0.975,N-1)
  )

avg_data %>%
  ggplot(aes(y = avg_score,x =  reorder(Speaker,avg_score), color = Speaker)) +
    geom_point() +
    geom_errorbar(aes(ymin = avg_score - ci, ymax = avg_score + ci),width = 0.2) +
    labs(x = "SPeaker", y = "Average score")

Semla · Aug 26, 2021

xaviescacs said:
If we consider that the song does not matter as all listeners listen to all songs the same number of times

Bit of nitpicking: you might want to check that, some songs occur more than others in this dataset.

xaviescacs said:
The t.test between the speakers of the two groups leads to the conclusion that the null hypothesis that the averages are drawn from the same distributions can be rejected with great confidence except for the pair JBL and KEF, that has a p-value of about 0.05, which is good but not enough for all confidence levels.

You need to adjust for multiple testing if you run multiple t-tests. It's better to use an ANOVA with post-hoc correction.

In addition, a key assumption for the t-test and ANOVA is that your samples are independent which is not the case here: observations are grouped ("nested" within listeners). In other words, scores from the same listener are more closely related than scores between listeners. Typically this is modelled using a hierarchical model. In R: lme4::lmer(score ~ speaker * song + (1 | listener)).

If you were to go for the simplified approach make sure to use a paired t-test - compare the difference for each speaker pair by listener, then correct for multiple testing. The drawbacks are that you won't be able to test formally that songs matter (song x speaker interaction term), calculate the within-listener correlation, or make a formal comparison between listeners.

xaviescacs · Aug 26, 2021

Semla said:
Bit of nitpicking: you might want to check that, some songs occur more than others in this dataset.

I didn't check the number of songs issue. But you are right, only the Listener one tested with Pink Noise. After excluding the listeners with the different scoring scale a get the following:

Song n avg_score
<chr> <int> <dbl>
1 Fast Car 40 5.92
2 Hunter 40 5.8
3 Just a Little Lovin 40 5.92
4 Morph the Cat 40 5.72
5 Pink Noise 4 5.5
6 Tin Pan Alley 40 6.02

Semla said:
Listeners disliked pink noise!

Why then you said this?

xaviescacs · Aug 26, 2021

Semla said:
You need to adjust for multiple testing if you run multiple t-tests. It's better to use an ANOVA with post-hoc correction.

In addition, a key assumption for the t-test and ANOVA is that your samples are independent which is not the case here: observations are grouped ("nested" within listeners). In other words, scores from the same listener are more closely related than scores between listeners. Typically this is modelled using a hierarchical model. In R: lme4::lmer(score ~ speaker * song + (1 | listener)).

If you were to go for the simplified approach make sure to use a paired t-test - compare the difference for each speaker pair by listener, then correct for multiple testing. The drawbacks are that you won't be able to test formally that songs matter (song x speaker interaction term), calculate the within-listener correlation, or make a formal comparison between listeners.

I'm not performing a linear model and this is the simplification. Therefore I'm not doing a multiple testing, just a t.test between the samples of interest, one by one.

Code:

t.test(x = data$Score[data$Speaker == "JBL Control X"], y = data$Score[data$Speaker == "KEF Q100"])

I understand this is not taking the cross relations into account, as a linear models do and therefore in contrast to what you have suggested, but again, this is the simplification. If I take all you suggest into account I'll end up doing the same as you.

A t.test requires the data to be t distributed. The independence issue you mention is for the CLT to apply, meaning that then you can say that the sample mean tend to a gaussian. Are you saying that we can't apply the CLT to this data ignoring the song and the listeners and just considering we have 40 samples for each speaker? We can try a simulation (after removing the Pink Noise song):

Code:

purrr::map_df(1:1000, ~ {
  data %>%
  group_by(Speaker) %>%
  sample_n(30) %>%
  summarise(
    avg_score = mean(Score)
  )
}) %>%
  ggplot(aes(x = avg_score,fill = Speaker)) +
    geom_density(alpha = 0.2)

Ok, this doesn't prove anything, but at least it shows it's not such a crazy idea.

I'm not claiming that this method is better than the linear model you proposed, just adding another way that is more simple and more people can understand yet shows essentially the same results. I think there is nothing wrong with that as long as we are transparent about how this is performed. Readers will judge for themselves.

Semla · Aug 26, 2021

xaviescacs said:
I'm not performing a linear model and this is the simplification. Therefore I'm not doing a multiple testing, just a t.test between the samples of interest, one by one.

You actually perform multiple comparisons. For every t-test you run a 5% risk of a false positive. Keep running enough of these and the overall risk of finding at least one false positive becomes quite high.

xaviescacs said:
A t.test requires the data to be t distributed.

That is wrong: the test statistic "t" follows a t-distribution under the null hypothesis if you meet certain assumptions, not the data.

One of these assumptions is independence. These data are not independent, they are nested within listener. (which decreases the effective sample size because they are correlated). Consider an analogy: a researcher samples blood from 1 patient 40 times, vs. blood from 40 patients, 1 time each. Both procedures produce 40 blood samples, but only the second procedure produces 40 independent blood samples.

The other assumption to keep in mind is that you need to sample at random from normally distributed populations with similar variance.

xaviescacs said:
I think there is nothing wrong with that as long as we are transparent about how this is performed. Readers will judge for themselves.

The p-values this method produces are too small because it underestimates the standard errors. That means that it is likely to produce more false positive results than you anticipated.

Here is some background reading:

xaviescacs · Aug 26, 2021

Thanks for your comments.

Semla said:
These data are not independent, they are nested within listener.

The fact that there is a segment doesn't mean that it's meaningful. What if the they added the gender or the age of the listener? That would make the measurements dependent on the age and gender? Independence means that the next outcome of the random variable doesn't not depend on the previous. If we take only the speaker and the score, this is a legitimate set of samples.

Semla said:
The p-values this method produces are too small because it underestimates the standard errors. That means that it is likely to produce more false positive results than you anticipated.

I obtain the following p-values:

JBL Control X" vs Revel W553L : p-value ~ 0.003
Revel W553L vs OSD 650 : p-value ~ 0.002

Of course the results are a bit different than yours, because the assumptions are different.

Semla said:
The other assumption to keep in mind is that you need to sample at random from normally distributed populations with similar variance.

This is true. In fact, I'm applying the CLT because 40 is a big enough number and therefore I don't need the data come from a normal distribution, and therefore I don't care about the t distribution. The use of the t.test is just a convention as the Student's t with 40 observations is like a normal in practice.

Semla said:
Keep running enough of these and the overall risk of finding at least one false positive becomes quite high.

I'm not following. Each test gives the same result every time, of course. What do you mean?

As a starting point, I have enough statistics background to discard the Wikipedia as a background. I don't discard you can convince me about the opposite though.

dualazmak · Aug 26, 2021

Just for reference, if we could have some "golden standard(s)", we may evaluate the reliability of the "test" also by sensitivity/specificity analysis; in this kind of audio ABX test, however, the concept of sensitivity/specificity would not be directly applicable, right? (I have been in the field of medical imaging diagnosis R&D for long years.)

xaviescacs · Aug 26, 2021

dualazmak said:
Just for reference, if we could have some "golden standard(s)", we may evaluate the reliability of the "test" also by sensitivity/specificity analysis; in this kind of audio ABX test, however, the concept of sensitivity/specificity would not be directly applicable, right? (I have been in the field of medical imaging diagnosis R&D for long years.)
View attachment 149669

It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.

To clarify, this is not ABX testing we are talking about in this thread. Here we just have some scores of 4 different speakers with some segments like the song and the listener and we can look at several things.

An ABX consist of a Bernoulli experiment, in which the outcome of each trial is A or B, corresponding to a subject stating that the sound sample is he hearing corresponds to A or B. If 9 out of 10 trials, the subject gives the right answer, we can say that the probability that he has been able to differentiate the two sound samples by change is low enough so we can state that he is able to tell them apart. We know how many times has the subject to get it right before because the level of the test does not depend on the data but on the definition of the test.

The amount of false positives accepted is determined by the level of the test. But we have to be precise about what are we testing. The null hypothesis of those t.tests, whether the ones coming from a simple approach or from a linear model, is that the sample mean obtained come from the same distribution and the differences are due to chance. When we got a low p-value means that we can reject this null hypothesis knowing that the probability of a false positive is equal the p-value.

The approach of @Semla is more ambitious as it builds a model to predict future scores.

The discussion with @Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.

Please, correct me if I'm wrong.

Edit: I forgot to answer whether that coefficients you point can be applied here.

The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

Major Contributor

Founder/Admin

Addicted to Fun and Learning

Member

Addicted to Fun and Learning

Member

Addicted to Fun and Learning

Major Contributor

Major Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Major Contributor

Attachments

Active Member

Major Contributor

Major Contributor

Active Member

Major Contributor

Major Contributor

Major Contributor

Similar threads