Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

dualazmak · Aug 26, 2021

xaviescacs said:
It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.
....
....
The approach of @Semla is more ambitious as it builds a model to predict future scores.
The discussion with @Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.
Please, correct me if I'm wrong.
Edit: I forgot to answer whether that coefficients you point can be applied here.
The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.

Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing ~~real~~ true cancer existence) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

xaviescacs · Aug 26, 2021

dualazmak said:
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positivre cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing real cancer existance) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

It's a pleasure talking with you.

You are talking about experimental design, that is, know how the experiment should be conducted for the results to have enough statistical significance to be accepted.

The key point here is to describe the setting and the goal of the experiment. In the precise case of this thread we are (many people more than just me) just playing with the data to see what we can get out of it mainly because the initial post didn't have any statistical analysis. But this is very different from conducting an experiment with the intention of prove (or reject) something. I've seen many people here posting results of some experiment and then being unable to convince anyone because they don't agree on the experimental design.

If you want to conduct an experiment and present the results here in ASR, the best way in my opinion is to open a thread explaining your goals and tools, and building with the community the experimental design: what to measure, how many times, etc. This design should include the characteristics of the test, level and power.

The golden rule you are asking for will highly depend on the experiment, on the problem at hand, so to speak, so there is no general answer. This is what statisticians do (I'm not one of them): face some experimental design to be able to maximize the output (the knowledge that can be obtained through the results) while minimizing the experiment costs. In this thread I don't see exactly what is the goal, so I don't dare to say how the experiment should have been designed.

You won't find what you want in this thread. Again, my suggestion is to make a post on what you imagine as the experiment and for sure there will be very informed people making suggestions that will eventually lead to a proper experimental design.

SmackDaddies · Aug 26, 2021

Such a great test. Is it perfect in set up and use? Probably not. But it certainly seems close enough.

Thanks for the effort.

David Harper · Aug 26, 2021

I suspect that at least some of those attacking these results most vehemently own either the JBL's or the OSD's. But that's just me.

Semla · Aug 27, 2021

xaviescacs said:
The fact that there is a segment doesn't mean that it's meaningful. What if the they added the gender or the age of the listener? That would make the measurements dependent on the age and gender? Independence means that the next outcome of the random variable doesn't not depend on the previous. If we take only the speaker and the score, this is a legitimate set of samples.

There is much more to independence and multilevel models than longitudinal data, for example data obtained from families, clusters, etc. Sometimes people talk about "technical replication" or "pseudoreplication" in this context. This, and how to distinguish between random and fixed effects is covered in introductory courses on ANOVA/linear models (ANOVAs are linear models).

xaviescacs said:
In fact, I'm applying the CLT because 40 is a big enough number and therefore I don't need the data come from a normal distribution, and therefore I don't care about the t distribution. The use of the t.test is just a convention as the Student's t with 40 observations is like a normal in practice.

That is not the case in general. What if you sampled from a cauchy distribution, or from an exponential? Would the CLT apply? Would it resemble a normal distribution with 40 samples? Try for yourself: qqplot(rexp(40, 0.001))

xaviescacs said:
I'm not following. Each test gives the same result every time, of course. What do you mean?

Names like Bonferroni, Šidák and Tukey should ring a bell.

xaviescacs said:
As a starting point, I have enough statistics background to discard the Wikipedia as a background.

Those wikipedia articles are actually nice starting points... Add the sensitivity and specificity page if you ever want to do predictive modelling or classification.

Using statistical software is easy, but some theoretical knowledge is required to do it in a sensible way. I've tried to show why doing a bunch of unpaired t-tests is not a good idea, but really, these issues would have been covered in an intro to stats course. I can only encourage you to deepen your knowledge/revisit your course notes and will stop debating this. It's become off-topic and too much like work.

preload · Aug 27, 2021

dualazmak said:
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...

In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.

For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing ~~real~~ true cancer existence) should be minimized.

I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.

sensitivity and specificity are not applicable to his particular experiment, like not even a little bit.

preload · Aug 27, 2021

richard12511 said:
@preload , was hoping someone would come forward with the source of the 13 bookshelf speaker blind study with .99 correlation factor, but no one has. I’ve only very recently started keeping track of Toole/Olive data in an organized manner, but I didn’t do that while reading his book or the huge 100+ pages on AVS. I’m 80% sure there are references to the study in the “How to choose a loudspeaker, what the science shows” thread, but it may take awhile to reread that.

I don’t have time to look back at the paper, but I’m pretty sure the r=0.99 was only for the initial preliminary set of 13 loudspeakers that was used as a proof of concept for the development of a more generalizable regression formula. The actual correlation between their best regression between predicted and actual pref scores in their more representative sample of 70 loudspeakers is the lower and commonly cited 0.86. Hope that helps.

xaviescacs · Aug 28, 2021

I agree this has gone too far. It's difficult to know when to stop. However, I just can't resist commenting on this. Excuse me for that.

Semla said:
That is not the case in general. What if you sampled from a cauchy distribution, or from an exponential? Would the CLT apply? Would it resemble a normal distribution with 40 samples? Try for yourself: qqplot(rexp(40, 0.001))

The CLT is about the sample mean, where is the sample mean in your code (with qqnorm)? The CLT states that the sample mean of ANY data drawn independently, regardless of its distribution, follows a normal distribution. This is why I can assume, having 40 points (the say the magic number is about 30, right?), that the distribution of the sample mean of each speaker follows a normal distribution. Not the data itself, but the sample mean as a random variable. Namely, that the sample mean I calculate for each speaker it's a realization of this normal random variable. This further allows me to build a t.test quite confidently.

I've done a simulation of this in my third post. Let me try with your exponential suggestion.

Code:

N <- 100000
sim_data <- replicate(N,{
  mean(rexp(40,0.001))
})
qqnorm(sim_data)

Code:

hist(sim_data)

Code:

tibble(
  x = sim_data,
  origin = "exp_sim"
) %>% 
union(
  tibble(
    x = rnorm(n = N,mean(sim_data),sd(sim_data)),
    origin = "rnorm"
  )
) %>% 
ggplot(aes(x = x, fill = origin)) +
  geom_density(alpha = 0.2)

This is pretty basic statistics that is covered in any introductory statistics course. If you want some theoretical background you can take a look at Wasserman, L., All of Statistics, page 77, or Casella, G. and Berger, R., Statistical Inference (Second edition), page 236.

It's very easy to take a course in applied statistics, there are tons of them, and start running linear models accounting for everything without understanding what's under the hood.

Looks like I'm not the only one who needs to revisit his course notes, right?

Now seriously. I know you understand the CLT and its implications. But please, stop conjecturing about what other people know, understand or should revisit. It's just not your business and I don't think it's a matter of interest for anyone in this forum. There are very informed readers here more than capable of extracting their own conclusions without loosing their time in such considerations. If you think some analysis is not correct, explain the reasons and keep the discussion on the subject and concepts, let the readers judge for themselves and let the poor fool who wrote it in peace.

Thanks for your time and patience anyway. See you there in any other thread. And don't accuse me of stealing your time, I didn't quote you and you started this discussion.

SmackDaddies · Sep 3, 2021

xaviescacs said:
And don't accuse me of stealing your time, I didn't quote you and you started this discussion.

snap

MatthewS · Oct 7, 2021

It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

ITU R 1770 loudness instead of C weighting
Significantly larger listening room
Better positioning of speakers (no table)
Attempt to bring in more listeners that skew trained
Only similar speakers (likely all powered bookshelf/monitors)
Take room measurements of each speaker at prime listening position
Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)

We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?

TLEDDY · Oct 7, 2021

MatthewS - you are clearly a masochist! That said, I and others on this Forum greatly appreciate the effort you are extending!

I would like to support your effort. I could PayPal a contribution- PM me information so I may help out.

Tillman

Semla · Oct 7, 2021

MatthewS said:
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

ITU R 1770 loudness instead of C weighting

Significantly larger listening room

Better positioning of speakers (no table)

Attempt to bring in more listeners that skew trained

Only similar speakers (likely all powered bookshelf/monitors)

Take room measurements of each speaker at prime listening position

Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)

We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?

Happy to help out!

Rick Sykora · Oct 7, 2021

MatthewS said:
It looks like @Inverse_Laplace and I will be conducting a second blind listening test. It'll be on 3/28/22 and/or 3/29/22, so we've got some time to plan.

We'd like to get feedback on improvements or changes for round two. Here are a few items so far:

ITU R 1770 loudness instead of C weighting

Significantly larger listening room

Better positioning of speakers (no table)

Attempt to bring in more listeners that skew trained

Only similar speakers (likely all powered bookshelf/monitors)

Take room measurements of each speaker at prime listening position

Possibly a set of scores with frequencies below the transition frequency rolled off (just seems like it might be interesting)

We are not planning to rotate speakers into an identical position. Our plan would be to only include speakers that have spinorama data or could be measured by @amirm at some point.

I own two powered speakers: Kef LSX and Vanatoo T0.

We're open to suggestions for speakers, we will probably limit the pool to 4 total. We don't have to stick with powered--but I find powered speakers more interesting than passive. If the speaker isn't cheap--someone will likely need to help provide a sample to be returned after the test. @Rick Sykora, any interest in including Directiva in this?

@amirm, since you're more or less local, we'd love to have you participate if you're interested. @GaryG?

@Semla, interested in helping with prep and post statistical work?

While I appreciate the effort and offer, will need to know about test conditions. For example, depending on location in a large room, r1 may need different bass tuning than it currently has. If the room is much larger than 50 ft3, than the higher output of r2 may be needed and not sure will be ready in time. Thanks!

Spocko · Oct 7, 2021

A few questions:

How far will the listening position be from the plane of the speakers?
If you have active speakers with DSP room correction, will this be enabled? If so, should you have both "enabled" and "disabled" listening tests with these room corrected speakers just so it's more oranges to oranges when other speakers do not have room correction?
Do you plan to have a wide selection of music/content with the intent to uncover speaker weaknesses or will the music selection be what listeners want to hear?
- Will reviewers be given enough time to familiarize themselves with the music selection?

I'm loving this project by the way and hoping some new cool active speakers will be available for purchase by the time your test rolls around.

MatthewS · Oct 31, 2021

TLEDDY said:
MatthewS - you are clearly a masochist! That said, I and others on this Forum greatly appreciate the effort you are extending!

I would like to support your effort. I could PayPal a contribution- PM me information so I may help out.

Tillman

Thank you! Let's get a little closer and see how things are shaping up, might have enough speakers on hand.

MatthewS · Oct 31, 2021

Rick Sykora said:
While I appreciate the effort and offer, will need to know about test conditions. For example, depending on location in a large room, r1 may need different bass tuning than it currently has. If the room is much larger than 50 ft3, than the higher output of r2 may be needed and not sure will be ready in time. Thanks!

It's an oddly shaped room (mid century modern). It's about 15.5' x 21'. The ceiling is sloped and is 8' high on the low side and 11' on the high side. There is a floor to ceiling fireplace in the middle.

MatthewS · Oct 31, 2021

Spocko said:
How far will the listening position be from the plane of the speakers?

If you have active speakers with DSP room correction, will this be enabled? If so, should you have both "enabled" and "disabled" listening tests with these room corrected speakers just so it's more oranges to oranges when other speakers do not have room correction?

Do you plan to have a wide selection of music/content with the intent to uncover speaker weaknesses or will the music selection be what listeners want to hear?

Will reviewers be given enough time to familiarize themselves with the music selection

The room is almost 16' long in the direction we'd probably sit. I'd guess 8' would be reasonable--though we'd be open to any suggestions.
We probably wouldn't do any room correction, except possibly doing something to address low frequency room modes. The jury is still out on how we might add extra tests here.
We will most likely use selections from Harman's top 25 tracks: https://artoflistening.harman.com/professional-reference-songs
- We previously gave folks however much time they wanted during the test. I believe the research suggests that listeners need not be very familiar with the material.

Rick Sykora · Oct 31, 2021

MatthewS said:
It's an oddly shaped room (mid century modern). It's about 15.5' x 21'. The ceiling is sloped and is 8' high on the low side and 11' on the high side. There is a floor to ceiling fireplace in the middle.

Probably more of an r2 size room...

Also, when Amir had a Directiva r1, might have just shipped the matching one. That is no longer an option. The r2 monitor may be passive, so would likely be a better fit when it ships to Amir. Timing looks about right too!

Spocko · Nov 2, 2021

MatthewS said:
The room is almost 16' long in the direction we'd probably sit. I'd guess 8' would be reasonable--though we'd be open to any suggestions.

We probably wouldn't do any room correction, except possibly doing something to address low frequency room modes. The jury is still out on how we might add extra tests here.

We will most likely use selections from Harman's top 25 tracks: https://artoflistening.harman.com/professional-reference-songs

We previously gave folks however much time they wanted during the test. I believe the research suggests that listeners need not be very familiar with the material.

How quickly can you switch song tracks? The latency between tracks and our recall memory are inversely proportional - the longer the latency, the more likely our recall becomes inaccurate, more importantly, I believe you should switch speakers within the same track rather than listening to the entire selection of different tracks and then switching over. To give you an idea of what I mean, here are examples of what I recently did to compare the differences (time stamped to the relevant section)

Sony X95J TV speaker vs its use as center channel within the HT-A9 home theater system compared:

HT-A9 speakers placed in various positions compared:

It's easier to quickly identify the subtle differences when specific tracks are played back to back while your memory is still fresh if the switch is within a 1 second. The reason I bring this up is because when I originally did my comparison test tracks, I had a hard time recalling the specific differences without playing the sections back to back like this and that was when I realized the inaccuracy of my own audio memory. Obviously, this approach may be nearly impossible in a real world setup but a few guiding principles may still be relevant:

You don't have to play the entire track of a song, but focus on maybe a 5-9 seconds of a section that best represents why this song was selected. The longer you play the song, the more time elapses to erase the memory of those sections within the song that "grabbed you"
Do not play the entire list of tracks before switching speakers because not only is your memory muddled by time but also by the variety of musical genres
Reduce the latency between speaker switches with practice - using a stop watch to see how quickly you can switch speakers and at some point you'll find an approach that is worthy of calling in the Guinness World Record keeper for speaker switching latency.
This is step one to help reviewers identify acoustical properties unique to each speaker. With this in mind, they can then proceed to step 2 and listen to the entire track with their focus on those properties separating each speaker and determining whether this is a pleasant or unpleasant difference.

Obviously, home theater audio reviews are a little easier in the sense that I'm not at all focused on the frequency curve and listener preference but rather the "immersive" sound quality and 3D realism in my comparisons. Nevertheless, we are both affected by the limits of our short term memory when it comes to specific acoustical detail.

MatthewS · Nov 4, 2021

Spocko said:
How quickly can you switch song tracks? The latency between tracks and our recall memory are inversely proportional - the longer the latency, the more likely our recall becomes inaccurate, more importantly, I believe you should switch speakers within the same track rather than listening to the entire selection of different tracks and then switching over.

We were able to switch tracks instantly. We used a software soundboard with the clips loaded and the output to each speaker randomized. We focused on a single musical selection at a time and we were able to jump back and forth between speakers on that track as much as the listener wanted. We pulled 30 second selections--and while we didn't have to listen to the entire selection to switch, we decided in future tests we would use a shorter clip. Listeners would often ask us to bounce back and forth between two speakers for 5 or so seconds at a time.

Here is what the software looked like:

Four Speaker Blind Listening Test Results (Kef, JBL, Revel, OSD)

Major Contributor

Major Contributor

Active Member

Senior Member

Active Member

Major Contributor

Major Contributor

Major Contributor

Active Member

Member

Addicted to Fun and Learning

Active Member

Major Contributor

Major Contributor

Member

Member

Member

Major Contributor

Major Contributor

Member

Similar threads