• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

dualazmak

Major Contributor
Forum Donor
Joined
Feb 29, 2020
Messages
2,820
Likes
2,950
Location
Ichihara City, Chiba Prefecture, Japan
It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.
....
....
The approach of @Semla is more ambitious as it builds a model to predict future scores.
The discussion with @Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.
Please, correct me if I'm wrong.
Edit: I forgot to answer whether that coefficients you point can be applied here.
The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.

Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real true cancer existence) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.
 
Last edited:

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positivre cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real cancer existance) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

It's a pleasure talking with you.

You are talking about experimental design, that is, know how the experiment should be conducted for the results to have enough statistical significance to be accepted.

The key point here is to describe the setting and the goal of the experiment. In the precise case of this thread we are (many people more than just me) just playing with the data to see what we can get out of it mainly because the initial post didn't have any statistical analysis. But this is very different from conducting an experiment with the intention of prove (or reject) something. I've seen many people here posting results of some experiment and then being unable to convince anyone because they don't agree on the experimental design.

If you want to conduct an experiment and present the results here in ASR, the best way in my opinion is to open a thread explaining your goals and tools, and building with the community the experimental design: what to measure, how many times, etc. This design should include the characteristics of the test, level and power.

The golden rule you are asking for will highly depend on the experiment, on the problem at hand, so to speak, so there is no general answer. This is what statisticians do (I'm not one of them): face some experimental design to be able to maximize the output (the knowledge that can be obtained through the results) while minimizing the experiment costs. In this thread I don't see exactly what is the goal, so I don't dare to say how the experiment should have been designed.

You won't find what you want in this thread. Again, my suggestion is to make a post on what you imagine as the experiment and for sure there will be very informed people making suggestions that will eventually lead to a proper experimental design.
 

David Harper

Senior Member
Joined
Jul 7, 2019
Messages
359
Likes
434
I suspect that at least some of those attacking these results most vehemently own either the JBL's or the OSD's. But that's just me.
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
The fact that there is a segment doesn't mean that it's meaningful. What if the they added the gender or the age of the listener? That would make the measurements dependent on the age and gender? Independence means that the next outcome of the random variable doesn't not depend on the previous. If we take only the speaker and the score, this is a legitimate set of samples.
There is much more to independence and multilevel models than longitudinal data, for example data obtained from families, clusters, etc. Sometimes people talk about "technical replication" or "pseudoreplication" in this context. This, and how to distinguish between random and fixed effects is covered in introductory courses on ANOVA/linear models (ANOVAs are linear models).

In fact, I'm applying the CLT because 40 is a big enough number and therefore I don't need the data come from a normal distribution, and therefore I don't care about the t distribution. The use of the t.test is just a convention as the Student's t with 40 observations is like a normal in practice.
That is not the case in general. What if you sampled from a cauchy distribution, or from an exponential? Would the CLT apply? Would it resemble a normal distribution with 40 samples? Try for yourself: qqplot(rexp(40, 0.001))

I'm not following. Each test gives the same result every time, of course. What do you mean?
Names like Bonferroni, Šidák and Tukey should ring a bell.

As a starting point, I have enough statistics background to discard the Wikipedia as a background.
Those wikipedia articles are actually nice starting points... Add the sensitivity and specificity page if you ever want to do predictive modelling or classification.

Using statistical software is easy, but some theoretical knowledge is required to do it in a sensible way. I've tried to show why doing a bunch of unpaired t-tests is not a good idea, but really, these issues would have been covered in an intro to stats course. I can only encourage you to deepen your knowledge/revisit your course notes and will stop debating this. It's become off-topic and too much like work.
 
Last edited:

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,554
Likes
1,701
Location
California
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real true cancer existence) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

sensitivity and specificity are not applicable to his particular experiment, like not even a little bit.
 

preload

Major Contributor
Forum Donor
Joined
May 19, 2020
Messages
1,554
Likes
1,701
Location
California
@preload , was hoping someone would come forward with the source of the 13 bookshelf speaker blind study with .99 correlation factor, but no one has. I’ve only very recently started keeping track of Toole/Olive data in an organized manner, but I didn’t do that while reading his book or the huge 100+ pages on AVS. I’m 80% sure there are references to the study in the “How to choose a loudspeaker, what the science shows” thread, but it may take awhile to reread that.
I don’t have time to look back at the paper, but I’m pretty sure the r=0.99 was only for the initial preliminary set of 13 loudspeakers that was used as a proof of concept for the development of a more generalizable regression formula. The actual correlation between their best regression between predicted and actual pref scores in their more representative sample of 70 loudspeakers is the lower and commonly cited 0.86. Hope that helps.
 
Last edited:

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,494
Likes
1,971
Location
La Garriga, Barcelona
I agree this has gone too far. It's difficult to know when to stop. However, I just can't resist commenting on this. Excuse me for that.

That is not the case in general. What if you sampled from a cauchy distribution, or from an exponential? Would the CLT apply? Would it resemble a normal distribution with 40 samples? Try for yourself: qqplot(rexp(40, 0.001))

The CLT is about the sample mean, where is the sample mean in your code (with qqnorm)? The CLT states that the sample mean of ANY data drawn independently, regardless of its distribution, follows a normal distribution. This is why I can assume, having 40 points (the say the magic number is about 30, right?), that the distribution of the sample mean of each speaker follows a normal distribution. Not the data itself, but the sample mean as a random variable. Namely, that the sample mean I calculate for each speaker it's a realization of this normal random variable. This further allows me to build a t.test quite confidently.

I've done a simulation of this in my third post. Let me try with your exponential suggestion.

Code:
N <- 100000
sim_data <- replicate(N,{
  mean(rexp(40,0.001))
})
qqnorm(sim_data)

qqnorm.png


Code:
hist(sim_data)

hist.png


Code:
tibble(
  x = sim_data,
  origin = "exp_sim"
) %>% 
union(
  tibble(
    x = rnorm(n = N,mean(sim_data),sd(sim_data)),
    origin = "rnorm"
  )
) %>% 
ggplot(aes(x = x, fill = origin)) +
  geom_density(alpha = 0.2)

density.png


This is pretty basic statistics that is covered in any introductory statistics course. If you want some theoretical background you can take a look at Wasserman, L., All of Statistics, page 77, or Casella, G. and Berger, R., Statistical Inference (Second edition), page 236.

It's very easy to take a course in applied statistics, there are tons of them, and start running linear models accounting for everything without understanding what's under the hood.

Looks like I'm not the only one who needs to revisit his course notes, right? ;)

Now seriously. I know you understand the CLT and its implications. But please, stop conjecturing about what other people know, understand or should revisit. It's just not your business and I don't think it's a matter of interest for anyone in this forum. There are very informed readers here more than capable of extracting their own conclusions without loosing their time in such considerations. If you think some analysis is not correct, explain the reasons and keep the discussion on the subject and concepts, let the readers judge for themselves and let the poor fool who wrote it in peace.

Thanks for your time and patience anyway. See you there in any other thread. And don't accuse me of stealing your time, I didn't quote you and you started this discussion.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
94
Likes
859
Location
Greater Seattle
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • Better positioning of speakers (no table)
  • Attempt to bring in more listeners that skew trained
  • Only similar speakers (likely all powered bookshelf/monitors)
  • Take room measurements of each speaker at prime listening position
  • Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)
We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?
 

TLEDDY

Addicted to Fun and Learning
Forum Donor
Joined
Aug 4, 2019
Messages
631
Likes
858
Location
Central Florida
MatthewS - you are clearly a masochist! That said, I and others on this Forum greatly appreciate the effort you are extending!

I would like to support your effort. I could PayPal a contribution- PM me information so I may help out.

Tillman
 

Semla

Active Member
Joined
Feb 8, 2021
Messages
170
Likes
328
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • Better positioning of speakers (no table)
  • Attempt to bring in more listeners that skew trained
  • Only similar speakers (likely all powered bookshelf/monitors)
  • Take room measurements of each speaker at prime listening position
  • Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)
We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?
Happy to help out!
 

Rick Sykora

Major Contributor
Forum Donor
Joined
Jan 14, 2020
Messages
3,517
Likes
7,027
Location
Stow, Ohio USA
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

  • ITU R 1770 loudness instead of C weighting
  • Significantly larger listening room
  • Better positioning of speakers (no table)
  • Attempt to bring in more listeners that skew trained
  • Only similar speakers (likely all powered bookshelf/monitors)
  • Take room measurements of each speaker at prime listening position
  • Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)
We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?
While I appreciate the effort and offer, will need to know about test conditions. For example, depending on location in a large room, r1 may need different bass tuning than it currently has. If the room is much larger than 50 ft3, than the higher output of r2 may be needed and not sure will be ready in time. Thanks!
 

Spocko

Major Contributor
Forum Donor
Joined
Sep 27, 2019
Messages
1,621
Likes
2,999
Location
Southern California
A few questions:
  • How far will the listening position be from the plane of the speakers?
  • If you have active speakers with DSP room correction, will this be enabled? If so, should you have both "enabled" and "disabled" listening tests with these room corrected speakers just so it's more oranges to oranges when other speakers do not have room correction?
  • Do you plan to have a wide selection of music/content with the intent to uncover speaker weaknesses or will the music selection be what listeners want to hear?
    • Will reviewers be given enough time to familiarize themselves with the music selection?
I'm loving this project by the way and hoping some new cool active speakers will be available for purchase by the time your test rolls around.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
94
Likes
859
Location
Greater Seattle
MatthewS - you are clearly a masochist! That said, I and others on this Forum greatly appreciate the effort you are extending!

I would like to support your effort. I could PayPal a contribution- PM me information so I may help out.

Tillman
Thank you! Let's get a little closer and see how things are shaping up, might have enough speakers on hand.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
94
Likes
859
Location
Greater Seattle
While I appreciate the effort and offer, will need to know about test conditions. For example, depending on location in a large room, r1 may need different bass tuning than it currently has. If the room is much larger than 50 ft3, than the higher output of r2 may be needed and not sure will be ready in time. Thanks!

It's an oddly shaped room (mid century modern). It's about 15.5' x 21'. The ceiling is sloped and is 8' high on the low side and 11' on the high side. There is a floor to ceiling fireplace in the middle.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
94
Likes
859
Location
Greater Seattle
  • How far will the listening position be from the plane of the speakers?
  • If you have active speakers with DSP room correction, will this be enabled? If so, should you have both "enabled" and "disabled" listening tests with these room corrected speakers just so it's more oranges to oranges when other speakers do not have room correction?
  • Do you plan to have a wide selection of music/content with the intent to uncover speaker weaknesses or will the music selection be what listeners want to hear?
    • Will reviewers be given enough time to familiarize themselves with the music selection

  • The room is almost 16' long in the direction we'd probably sit. I'd guess 8' would be reasonable--though we'd be open to any suggestions.
  • We probably wouldn't do any room correction, except possibly doing something to address low frequency room modes. The jury is still out on how we might add extra tests here.
  • We will most likely use selections from Harman's top 25 tracks: https://artoflistening.harman.com/professional-reference-songs
    • We previously gave folks however much time they wanted during the test. I believe the research suggests that listeners need not be very familiar with the material.
 

Rick Sykora

Major Contributor
Forum Donor
Joined
Jan 14, 2020
Messages
3,517
Likes
7,027
Location
Stow, Ohio USA
It's an oddly shaped room (mid century modern). It's about 15.5' x 21'. The ceiling is sloped and is 8' high on the low side and 11' on the high side. There is a floor to ceiling fireplace in the middle.

Probably more of an r2 size room...

Also, when Amir had a Directiva r1, might have just shipped the matching one. That is no longer an option. The r2 monitor may be passive, so would likely be a better fit when it ships to Amir. Timing looks about right too!
 

Spocko

Major Contributor
Forum Donor
Joined
Sep 27, 2019
Messages
1,621
Likes
2,999
Location
Southern California
  • The room is almost 16' long in the direction we'd probably sit. I'd guess 8' would be reasonable--though we'd be open to any suggestions.
  • We probably wouldn't do any room correction, except possibly doing something to address low frequency room modes. The jury is still out on how we might add extra tests here.
  • We will most likely use selections from Harman's top 25 tracks: https://artoflistening.harman.com/professional-reference-songs
    • We previously gave folks however much time they wanted during the test. I believe the research suggests that listeners need not be very familiar with the material.
How quickly can you switch song tracks? The latency between tracks and our recall memory are inversely proportional - the longer the latency, the more likely our recall becomes inaccurate, more importantly, I believe you should switch speakers within the same track rather than listening to the entire selection of different tracks and then switching over. To give you an idea of what I mean, here are examples of what I recently did to compare the differences (time stamped to the relevant section)
  • Sony X95J TV speaker vs its use as center channel within the HT-A9 home theater system compared:

  • HT-A9 speakers placed in various positions compared:
It's easier to quickly identify the subtle differences when specific tracks are played back to back while your memory is still fresh if the switch is within a 1 second. The reason I bring this up is because when I originally did my comparison test tracks, I had a hard time recalling the specific differences without playing the sections back to back like this and that was when I realized the inaccuracy of my own audio memory. Obviously, this approach may be nearly impossible in a real world setup but a few guiding principles may still be relevant:
  • You don't have to play the entire track of a song, but focus on maybe a 5-9 seconds of a section that best represents why this song was selected. The longer you play the song, the more time elapses to erase the memory of those sections within the song that "grabbed you"
  • Do not play the entire list of tracks before switching speakers because not only is your memory muddled by time but also by the variety of musical genres
  • Reduce the latency between speaker switches with practice - using a stop watch to see how quickly you can switch speakers and at some point you'll find an approach that is worthy of calling in the Guinness World Record keeper for speaker switching latency.
  • This is step one to help reviewers identify acoustical properties unique to each speaker. With this in mind, they can then proceed to step 2 and listen to the entire track with their focus on those properties separating each speaker and determining whether this is a pleasant or unpleasant difference.
Obviously, home theater audio reviews are a little easier in the sense that I'm not at all focused on the frequency curve and listener preference but rather the "immersive" sound quality and 3D realism in my comparisons. Nevertheless, we are both affected by the limits of our short term memory when it comes to specific acoustical detail.
 
OP
M

MatthewS

Member
Forum Donor
Joined
Jul 31, 2020
Messages
94
Likes
859
Location
Greater Seattle
How quickly can you switch song tracks? The latency between tracks and our recall memory are inversely proportional - the longer the latency, the more likely our recall becomes inaccurate, more importantly, I believe you should switch speakers within the same track rather than listening to the entire selection of different tracks and then switching over.

We were able to switch tracks instantly. We used a software soundboard with the clips loaded and the output to each speaker randomized. We focused on a single musical selection at a time and we were able to jump back and forth between speakers on that track as much as the listener wanted. We pulled 30 second selections--and while we didn't have to listen to the entire selection to switch, we decided in future tests we would use a shorter clip. Listeners would often ask us to bounce back and forth between two speakers for 5 or so seconds at a time.

Here is what the software looked like:

soundboard.png
 
Top Bottom