• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required as is 20 years of participation in forums (not all true). Come here to have fun, be ready to be teased and not take online life too seriously. We now measure and review equipment for free! Click here for details.

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

dualazmak

Major Contributor
Forum Donor
Joined
Feb 29, 2020
Messages
1,009
Likes
1,059
Location
Ichihara City, Chiba Prefecture, Japan
It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.
....
....
The approach of @Semla is more ambitious as it builds a model to predict future scores.
The discussion with @Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.
Please, correct me if I'm wrong.
Edit: I forgot to answer whether that coefficients you point can be applied here.
The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.

Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real true cancer existence) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.
 
Last edited:

xaviescacs

Active Member
Forum Donor
Joined
Mar 23, 2021
Messages
241
Likes
187
Location
Barcelona
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positivre cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real cancer existance) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

It's a pleasure talking with you.

You are talking about experimental design, that is, know how the experiment should be conducted for the results to have enough statistical significance to be accepted.

The key point here is to describe the setting and the goal of the experiment. In the precise case of this thread we are (many people more than just me) just playing with the data to see what we can get out of it mainly because the initial post didn't have any statistical analysis. But this is very different from conducting an experiment with the intention of prove (or reject) something. I've seen many people here posting results of some experiment and then being unable to convince anyone because they don't agree on the experimental design.

If you want to conduct an experiment and present the results here in ASR, the best way in my opinion is to open a thread explaining your goals and tools, and building with the community the experimental design: what to measure, how many times, etc. This design should include the characteristics of the test, level and power.

The golden rule you are asking for will highly depend on the experiment, on the problem at hand, so to speak, so there is no general answer. This is what statisticians do (I'm not one of them): face some experimental design to be able to maximize the output (the knowledge that can be obtained through the results) while minimizing the experiment costs. In this thread I don't see exactly what is the goal, so I don't dare to say how the experiment should have been designed.

You won't find what you want in this thread. Again, my suggestion is to make a post on what you imagine as the experiment and for sure there will be very informed people making suggestions that will eventually lead to a proper experimental design.
 

David Harper

Senior Member
Joined
Jul 7, 2019
Messages
302
Likes
353
I suspect that at least some of those attacking these results most vehemently own either the JBL's or the OSD's. But that's just me.
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
169
Likes
303
The fact that there is a segment doesn't mean that it's meaningful. What if the they added the gender or the age of the listener? That would make the measurements dependent on the age and gender? Independence means that the next outcome of the random variable doesn't not depend on the previous. If we take only the speaker and the score, this is a legitimate set of samples.
There is much more to independence and multilevel models than longitudinal data, for example data obtained from families, clusters, etc. Sometimes people talk about "technical replication" or "pseudoreplication" in this context. This, and how to distinguish between random and fixed effects is covered in introductory courses on ANOVA/linear models (ANOVAs are linear models).

In fact, I'm applying the CLT because 40 is a big enough number and therefore I don't need the data come from a normal distribution, and therefore I don't care about the t distribution. The use of the t.test is just a convention as the Student's t with 40 observations is like a normal in practice.
That is not the case in general. What if you sampled from a cauchy distribution, or from an exponential? Would the CLT apply? Would it resemble a normal distribution with 40 samples? Try for yourself: qqplot(rexp(40, 0.001))

I'm not following. Each test gives the same result every time, of course. What do you mean?
Names like Bonferroni, Šidák and Tukey should ring a bell.

As a starting point, I have enough statistics background to discard the Wikipedia as a background.
Those wikipedia articles are actually nice starting points... Add the sensitivity and specificity page if you ever want to do predictive modelling or classification.

Using statistical software is easy, but some theoretical knowledge is required to do it in a sensible way. I've tried to show why doing a bunch of unpaired t-tests is not a good idea, but really, these issues would have been covered in an intro to stats course. I can only encourage you to deepen your knowledge/revisit your course notes and will stop debating this. It's become off-topic and too much like work.
 
Last edited:

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,060
Likes
1,196
Location
California
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real true cancer existence) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

sensitivity and specificity are not applicable to his particular experiment, like not even a little bit.
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,060
Likes
1,196
Location
California
@preload , was hoping someone would come forward with the source of the 13 bookshelf speaker blind study with .99 correlation factor, but no one has. I’ve only very recently started keeping track of Toole/Olive data in an organized manner, but I didn’t do that while reading his book or the huge 100+ pages on AVS. I’m 80% sure there are references to the study in the “How to choose a loudspeaker, what the science shows” thread, but it may take awhile to reread that.
I don’t have time to look back at the paper, but I’m pretty sure the r=0.99 was only for the initial preliminary set of 13 loudspeakers that was used as a proof of concept for the development of a more generalizable regression formula. The actual correlation between their best regression between predicted and actual pref scores in their more representative sample of 70 loudspeakers is the lower and commonly cited 0.86. Hope that helps.
 
Last edited:

xaviescacs

Active Member
Forum Donor
Joined
Mar 23, 2021
Messages
241
Likes
187
Location
Barcelona
I agree this has gone too far. It's difficult to know when to stop. However, I just can't resist commenting on this. Excuse me for that.

That is not the case in general. What if you sampled from a cauchy distribution, or from an exponential? Would the CLT apply? Would it resemble a normal distribution with 40 samples? Try for yourself: qqplot(rexp(40, 0.001))

The CLT is about the sample mean, where is the sample mean in your code (with qqnorm)? The CLT states that the sample mean of ANY data drawn independently, regardless of its distribution, follows a normal distribution. This is why I can assume, having 40 points (the say the magic number is about 30, right?), that the distribution of the sample mean of each speaker follows a normal distribution. Not the data itself, but the sample mean as a random variable. Namely, that the sample mean I calculate for each speaker it's a realization of this normal random variable. This further allows me to build a t.test quite confidently.

I've done a simulation of this in my third post. Let me try with your exponential suggestion.

Code:
N <- 100000
sim_data <- replicate(N,{
  mean(rexp(40,0.001))
})
qqnorm(sim_data)

qqnorm.png


Code:
hist(sim_data)

hist.png


Code:
tibble(
  x = sim_data,
  origin = "exp_sim"
) %>% 
union(
  tibble(
    x = rnorm(n = N,mean(sim_data),sd(sim_data)),
    origin = "rnorm"
  )
) %>% 
ggplot(aes(x = x, fill = origin)) +
  geom_density(alpha = 0.2)

density.png


This is pretty basic statistics that is covered in any introductory statistics course. If you want some theoretical background you can take a look at Wasserman, L., All of Statistics, page 77, or Casella, G. and Berger, R., Statistical Inference (Second edition), page 236.

It's very easy to take a course in applied statistics, there are tons of them, and start running linear models accounting for everything without understanding what's under the hood.

Looks like I'm not the only one who needs to revisit his course notes, right? ;)

Now seriously. I know you understand the CLT and its implications. But please, stop conjecturing about what other people know, understand or should revisit. It's just not your business and I don't think it's a matter of interest for anyone in this forum. There are very informed readers here more than capable of extracting their own conclusions without loosing their time in such considerations. If you think some analysis is not correct, explain the reasons and keep the discussion on the subject and concepts, let the readers judge for themselves and let the poor fool who wrote it in peace.

Thanks for your time and patience anyway. See you there in any other thread. And don't accuse me of stealing your time, I didn't quote you and you started this discussion.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
39
Likes
244
Location
Greater Seattle
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • Better positioning of speakers (no table)
  • Attempt to bring in more listeners that skew trained
  • Only similar speakers (likely all powered bookshelf/monitors)
  • Take room measurements of each speaker at prime listening position
  • Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)
We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?
 

TLEDDY

Senior Member
Forum Donor
Joined
Aug 4, 2019
Messages
411
Likes
413
Location
Central Florida
MatthewS - you are clearly a masochist! That said, I and others on this Forum greatly appreciate the effort you are extending!

I would like to support your effort. I could PayPal a contribution- PM me information so I may help out.

Tillman
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
169
Likes
303
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • Better positioning of speakers (no table)
  • Attempt to bring in more listeners that skew trained
  • Only similar speakers (likely all powered bookshelf/monitors)
  • Take room measurements of each speaker at prime listening position
  • Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)
We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?
Happy to help out!
 

Rick Sykora

Major Contributor
Forum Donor
Joined
Jan 14, 2020
Messages
1,588
Likes
2,565
Location
Stow, Ohio USA
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • Better positioning of speakers (no table)
  • Attempt to bring in more listeners that skew trained
  • Only similar speakers (likely all powered bookshelf/monitors)
  • Take room measurements of each speaker at prime listening position
  • Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)
We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?
While I appreciate the effort and offer, will need to know about test conditions. For example, depending on location in a large room, r1 may need different bass tuning than it currently has. If the room is much larger than 50 ft3, than the higher output of r2 may be needed and not sure will be ready in time. Thanks!
 

Spocko

Addicted to Fun and Learning
Forum Donor
Joined
Sep 27, 2019
Messages
852
Likes
1,574
Location
Southern California
A few questions:
  • How far will the listening position be from the plane of the speakers?
  • If you have active speakers with DSP room correction, will this be enabled? If so, should you have both "enabled" and "disabled" listening tests with these room corrected speakers just so it's more oranges to oranges when other speakers do not have room correction?
  • Do you plan to have a wide selection of music/content with the intent to uncover speaker weaknesses or will the music selection be what listeners want to hear?
    • Will reviewers be given enough time to familiarize themselves with the music selection?
I'm loving this project by the way and hoping some new cool active speakers will be available for purchase by the time your test rolls around.
 
Top Bottom