Binaural blind comparison test of 4 loudspeakers

preload · Oct 13, 2021

Haruko said:
there is a way to level them ! i will tell u the LUFS for each of the tracks: [-20.6 -15.8] [-19.8 -14.7] [-18.7 -15.5] [-20.0 -15.3] so i would just pick the lowest value [-20.6 -15.8] and level the rest of the tracks....

Can I double-check this is what you're saying about the LUFS volume measurements?

Track 1:
Volume (highest) C > B > D > A (lowest)
Track 2:
Volume (highest) B > D > C > A (lowest)

Because if so, and removing C (which was the anchor and would sound noticeably worse at any volume), for both tracks, the Revel recordings were slightly higher in volume than the Quad and the B&W:

Revel (highest volume) > Quad > B&W (lowest volume) for both tracks

From an experimental design standpoint, I think it's widely accepted on ASR that level differences can bias results in favor of the louder recording. I get that there was a desire not to alter the recordings, but I believe it's also well known that it's pretty easy to make transparent level adjustments with software.

thewas · Oct 13, 2021

preload said:
From an experimental design standpoint, I think it's widely accepted on ASR that level differences can bias results in favor of the louder recording. I get that there was a desire not to alter the recordings, but I believe it's also well known that it's pretty easy to make transparent level adjustments with software.

As I wrote before the problem is rather the different frequency and dynamic responses of different loudspeakers, for example a loudspeaker which goes deeper in bass and/or will compress less will have a higher averaged SPL even if the rest of the range they have the same level, so you can only do some choices on which calculation they should give the same result but this doesn't mean that will also sound really same loud all the time. For example the ReplayGain analysis gives different level differences than the above LUFS one. This means you that there is unfortunately not a single or correct or wrong method how to equalise them when you have audio devices with different frequency and/or dynamic responses.

preload · Oct 13, 2021

thewas said:
As I wrote before the problem is rather the different frequency and dynamic responses of different loudspeakers, for example a loudspeaker which goes deeper in bass and/or will compress less will have a higher averaged SPL even if the rest of the range they have the same level, so you can only do some choices on which calculation they should give the same result but this doesn't mean that will also sound really same loud all the time. For example the ReplayGain analysis gives different level differences than the above LUFS one. This means you that there is unfortunately not a single or correct or wrong method how to equalise them when you have audio devices with different frequency and dynamic responses.

I hear you when you say that differences in FR of each speaker along with the lack of a standard made it challenging to come up with a way to level match the files you borrowed. However, at the same time, surely we all know that differences in level can bias the listener to prefer the louder version. So yes it may not have been practical to attempt level matching, but that doesn't change the fact that bias would have existed in the form of level differences (and in this case, fully explaining the listener selections, when excluding speaker C as the low-performing anchor).

Also, yes, there are ways to level match when auditing loudspeakers. The method used by Olive in JAES Convention Paper 6113 (2014) was:
"Each loudspeaker was level-matched to within 0.25 dB (B-weighted) using pink noise fed to each loudspeaker. The calibrated microphone (AKG-CK62) was positioned at ear-height over the middle front row chair. Levels were calculated using SpectraLAB (version 4.32)."
This was the method used in Olive's well known paper where he developed the regression formula to predict blind listening preferences from measurements, where eliminating bias from level mismatch would have critical to what he was trying to do.

thewas · Oct 13, 2021

preload said:
Also, yes, there are ways to level match when auditing loudspeakers. The method used by Olive in JAES Convention Paper 6113 (2014) was

And once again, this one of dozens methods but none guarantee that the loudspeakers will actually and consistently sound equally loud to everyone.

To make that more clear for you, here is another simple example:

Assume loudspeaker A with a straight line frequency response rising from 20 Hz to 20 kHz for 5 dB and loudspeaker B with a straight line frequency response falling from 20 Hz to 20 kHz for 5 dB.

Now you level match them to any possible method, resulting that their responses cross somewhere in the 20 Hz to 20 kHz range.

When playing tracks or parts of tracks where the tonal spectrum dominates in the region above the crossing frequency loudspeaker A will sound louder while the opposite for the other region and there is nothing you do about it when using variable spectrum signals like music.

By the way also the team who did those recordings back then did level match them in some way as otherwise for example the Klipsch would have been approximately 15 dB louder than the Quad:

preload · Oct 13, 2021

thewas said:
And once again, this one of dozens methods but none guarantee that the loudspeakers will actually and consistently sound equally loud to everyone.

To make that more clear for you, here is another simple example:

Assume loudspeaker A with a straight line frequency response rising from 20 Hz to 20 kHz for 5 dB and loudspeaker B with a straight line frequency response falling from 20 Hz to 20 kHz for 5 dB.

Now you level match them to any possible method, resulting that their responses cross somewhere in the 20 Hz to 20 kHz range.

When playing tracks or parts of tracks where the tonal spectrum dominates in the region above the crossing frequency loudspeaker A will sound louder while the opposite for the other region and there is nothing you do about it when using variable spectrum signals like music.

By the way also the team who did those recordings back then did level match them in some way as otherwise for example the Klipsch would have been approximately 15 dB louder than the Quad:

Right I obviously understand why level matching speakers is not easy or straightforward when the FR curves differ and you’re playing back music that may involve different parts of the frequency spectrum. I’m not disputing that.

What I am saying is that for the purposes of correcting for experimental bias resulting from level mismatch, it IS possible and necessary to make level corrections. While there are likely dozens of methods, as you noted, I am pointing out that the aforementioned level correction method by Olive was validated specifically for blind loud speaker listening tests, it was peer reviewed, and it enabled him to create the now-famous preference score (which would not have been possible if level mismatch had skewed his results). I’m other words, yes there’s an acceptable way to level match to sufficiently remove listener bias, it circumvents your stated concerns, and it’s been done successfully for the purposes of audio research of loudspeaker preference.

However what is more concerning to me is that you appear to be speculating what the conditions of.the recording were (ie they “probably” did something with the levels based on your examination of their FR charts). Therein lies another problem. If I didn’t know any better, it would appear that you know very little about these loudspeaker recordings you acquired: what level corrections they used, what the room size/shape/treatment/RT60 was, where the speakers were positioned, how the speakers were toed in, mic height and distance, etc. All basic questions I would expect an open minded individual might want to know.

If we don’t have sufficient detail about how the recordings were made and corrected for bias, how are readers who weren’t involved in this experiment supposed to know whether the results are real or if they’re actually an artifact of a problematic experimental design?

BostonJack · Oct 13, 2021

I'm pleased that my first choice (the Revel) was in line with consensus and feel ok about my second choice (the Quad). Interesting in that my non-consensus choice was made because there was something about more detail being presented, which isn't unreasonable.

Binaural recordings were *awful*, truly awful. That, in a way, is the most interesting result: how did we get to a more-or-less rational, more-or-less "correct" consensus when the recordings themselves seemed to be totally butchering the music?

The second interesting thing is that the best of those Youtube speaker reviews seem to show pretty valid speaker differences. Are those perhaps an actual basis for comparing speakers? Or at least a worthwhile input?

thewas · Oct 13, 2021

preload said:
What I am saying is that for the purposes of correcting for experimental bias resulting from level mismatch, it IS possible and necessary to make level corrections. While there are likely dozens of methods, as you noted, I am pointing out that the aforementioned level correction method by Olive was validated specifically for blind loud speaker listening tests, it was peer reviewed, and it enabled him to create the now-famous preference score (which would not have been possible if level mismatch had skewed his results). I’m other words, yes there’s an acceptable way to level match to sufficiently remove listener bias, it circumvents your stated concerns, and it’s been done successfully for the purposes of audio research of loudspeaker preference.

Again, just because Olive used it and described it in an AES paper it doesn't mean it will eliminate the problem of different FR and dynamics and also that is THE method do it, or do you see a validation of it in the paper? Different loudspeaker will always sound different loud with different music material even at different parts of a track, this will always happen and there won't be one and correct method to do something to avoid it. This can of course be used as a red herring making any audio device comparison "impossible" but that is not true. The real question namely is would and if yes how the result be different with different level adjustment methods and how much this corresponds to the listeners reality where the listened will adjust a level that sounds comfortable to him and listen and compare.

preload said:
If I didn’t know any better, it would appear that you know very little about these loudspeaker recordings you acquired: what level corrections they used, what the room size/shape/treatment/RT60 was, where the speakers were positioned, how the speakers were toed in, mic height and distance, etc. All basic questions I would expect an open minded individual might want to know.

I know what is written in the corresponding issue(s), not more or less, we shouldn't forget it was a consumer mag test where such details are unfortunately not published and not an AES paper. By the way even in the famous Harman tests there will always discussions about for example the optimal placement and toe-in (even almost all loudspeaker manuals give not clear recommendations there but just the general >0,5 m distance to walls stuff bla bla) and to which living room style the RT60 should correspond. In each test we can only listen and judge to what happened in that exact setup but despite the very different approaches the results seem to correlate and some colorations could identified through the recordings. These are the data points we have for now, as I repeatedly wrote hope more will be performed in the future, even better from ASR members with today's knowledge and state of the art tech.

preload · Oct 13, 2021

BostonJack said:
I'm pleased that my first choice (the Revel) was in line with consensus and feel ok about my second choice (the Quad). Interesting in that my non-consensus choice was made because there was something about more detail being presented, which isn't unreasonable.

You may be interested in knowing that the Revel recording was the loudest, and the Quad was the second loudest.

Binaural recordings were *awful*, truly awful. That, in a way, is the most interesting result: how did we get to a more-or-less rational, more-or-less "correct" consensus when the recordings themselves seemed to be totally butchering the music?

That is exactly the question I’m trying to answer. Because if something is too good to be true, it probably is. Or at least we should ask questions.

Tangband · Oct 20, 2021

Both A and D has less good directivity, even if they have good reputations in the hifi-area.

CrazyDwarf · Jan 20, 2022

Where is the zip file with the recordings? I cannot seem to find them to try and compare myself. Kind regards.

garyrc · Nov 17, 2022

LTig said:
Just wait until the owners of B&W and Klipschhorn stumble over this thread ...

I'm revving up right now!

garyrc · Nov 18, 2022

preload said:
If we don’t have sufficient detail about how the recordings were made and corrected for bias, how are readers who weren’t involved in this experiment supposed to know whether the results are real or if they’re actually an artifact of a problematic experimental design?

I agree with the above, and most of what @preload said.

Was there control for Order Effect?

Was there control for Carry-over Effect?

How recently had the participants heard LIVE music of similar style?

Klipschorns of that vintage needed to be SEALED into a corner with closed cell neoprene, or compressed pipe insulation or the Klipsch sealing kit. By the way, it is estimated that there was one appreciable change in the Klipschorn every 10 years, on average. The ones I know about are 1 in 1957, 2 in 1963, 1 in 1980, 1 in 1983, 1 in 1987, 3 in 2002, and 2 in 2020.

One of the few things Edgar Villcher and Paul Klipsch agreed about was that, in many, if not most, rooms, if someone moved their head [or a mic] a few inches, the frequency response would change. In modern times, most people somehow combined the response from several mic positions [tight around the head of a single listener -- the proverbial head clamp -- or with some kind of compromise(d) multiple mic positions spread around a wider group of listeners]. The combining process ranged from an average to a "fuzzy logic" solution. Here is one of my Klipschorns (AK4), with Bass Tone Control at + 6 dB, and Audyssey Reference (causing a Harman-like downward tilt) in my 4,000 ++ cubic foot room, at 13 feet away, as "heard" by a calibrated mic in 8 positions, in a field about 3.5 "seats" wide on a couch, at ear height. The sweep starts at 43Hz, because that is where the crossover to a subwoofer begins [actually 40Hz], and I wanted this to be K-horn only.

average of 8 mic positions Aud Ref, B+6, Box on, Sub Trim + 3 on Marantz.jpg

In response to the person who said, "I thought people bought them for bass," while a little EQ is usually used, bass impact (Fanfare for the Common Man, The Great Gate of Kiev, the end of some Mahler or Beethoven symphonies) without a subwoofer, will flap one's pants legs, and seem to move the couch, and bend my desk out of square.

FWIW, I have played in 5 orchestras, listened to many more, and, although I haven't heard the Revel, nothing I have heard has the convincing brass, piano, woodwinds and percussion of the Klipschorn. Some others do strings better. I'm confident that, for a much higher price, there must be several others.

I am aware some are saying speaker distortion matters not, but here are some notes, FWIW:

thewas · Nov 18, 2022

garyrc said:
I am aware some are saying speaker distortion matters not

I don't think anyone says that, only that their audibility is not easy to determine

Floyd Toole said:
Non-linear distortion measurements are routine in the design of transducers and systems, but they are mainly useful in a relative not absolute sense, meaning that if a measurement gets smaller, that is good, but whatever the measurement is does not reflect how it might sound in music, in a room. It is very rare for distortion to be a factor in listening test results, but it has happened, most recently IMD in a concentric mid/tweet driver. Here is something I wrote on the topic a while back.

and most important

"Frequency response is the single most important aspect of the performance of any audio device. If it is wrong, nothing else matters." – Floyd Toole, 2009

gino1961 · Nov 18, 2022

garyrc said:
.... I am aware some are saying speaker distortion matters not, but here are some notes, FWIW:
View attachment 244104

Hi ! thank you very much for disclosing this test results. I find very interesting to see that the higher the distortion is and the more annoying is the effect on perceived sound
People often hear with the eyes ... look can be deceiving Blind testing is the way to go Unfortunately ugly but good sounding speakers will never get the consensus of the standard wife ... she looks usually at a speaker like another piece of furniture

preload · Nov 18, 2022

garyrc said:
I agree with the above, and most of what @preload said.

Was there control for Order Effect?
Was there control for Carry-over Effect?
How recently had the participants heard LIVE music of similar style?

Good point. I doubt they randomized the listening order, which would have eliminated these types of biases. However, I honestly think that's the least of our worries, since the so-called experiment had so many problems, I don't even know where to begin. Folks that don't deal with experimental design all day and just see the word "blind," tend to suddenly think that a "study" is unbiased and trustworthy. That is simply not true. There are a ton of other biases (other than that caused by "sight," which blinding controls for) that were not accounted for in this magazine report.

For starters, LEVELS weren't matched. NOT SURPRISINGLY, the "blinded" listeners preferred the speakers that had higher levels (which we confirmed).

Secondly, many speakers need to be placed in optimal positions in the room with optimal toe-in. From what I can tell, the speakers were simply wheeled into the room with no attention to fine-tuning speaker placement. Everybody knows that you can take high performance speakers and make them sound terrible by placing them in the wrong part of the room - and yes, this affects the frequency response by audible amounts. What's worse is that for speakers that are engineered intentionally with changes in directivity, the toe-in matters a LOT because the off-axis response differs by angle.

So, individuals who read this magazine report, don't have more than a superficial background in experimental design, and see that the Harman speaker was scored "#1" will stop there and conclude that the magazine experiment proves that the Revel speaker has the most preferred sound quality. And these individuals would actually be exhibiting something called "confirmation bias," which is the tendency for people to favor information in a way that confirms their original beliefs. And I'm seeing a lot of confirmation bias in this thread.

Binaural blind comparison test of 4 loudspeakers

Which loudspeaker sound do you personally prefer?

Loudspeaker A

Loudspeaker B

Loudspeaker C

Loudspeaker D

preload

Major Contributor

thewas

Master Contributor

preload

Major Contributor

thewas

Master Contributor

preload

Major Contributor

BostonJack

Active Member

thewas

Master Contributor

preload

Major Contributor

Tangband

Major Contributor

CrazyDwarf

Member

garyrc

Active Member

garyrc

Active Member

thewas

Master Contributor

gino1961

Addicted to Fun and Learning

preload

Major Contributor

Similar threads