• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required as is 20 years of participation in forums (not all true). Come here to have fun, be ready to be teased and not take online life too seriously. We now measure and review equipment for free! Click here for details.

The frailty of Sighted Listening Tests

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#23
That we use trained listeners in sighted evaluations because they are quick and in general they are far more correct than wrong?

Do we really know the statement above is true?
I do because that is the outcome we had at Microsoft in the team that I managed and the work I personally did myself. We hired trained listeners for this very purpose. They would have no value to us if they were as wrong as average Joe in sighted listening which was the bulk of our evaluation/testing.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#24
In my humble opinion it is that both the subjective opinion and the score should not be, being ASR. Only the measurements should be published.
Klippel = science.
Amir = a human being.
But I also understand that: Amir = his rules.
Klippel is not science. It is a set of measurements. Those measurements are very difficult to interpret as "buy don't buy" against countless other speakers with similar looking measurements. 1 dB peak at 600 Hz is not the same as 1 dB peak at 1.5 kHz. Yet the score may be identical. We need to bridge that gap so people can purchase speakers without listening to them which is the norm today.

If we had a scoring system where we could all stand behind it so much that if it said speaker a is better than b, that would be the "truth," then sure, I would not need to do listening tests. But we are not there. Scoring system is like a compass that shows you north. It is not a turn by turn navigation system for driving in the city.

Also, when I first started to do measurements, people kept asking me what I recommend. I refused to say so. We had a bunch of debate threads about them. Eventually I got tired of answering those questions in private and in public and added the recommendations. That has proven to be hugely popular and rarely controversial. Today, I cannot, without listening to a speaker, give such recommendations. So as much work and aggravation it has turned out to be, I listen and provide this as a factor in my recommendation.

And no, not all "human beings" are the same. Which one of you has been exposed to nearly 80 speakers in the last 7 months where you could compare and correlate measurements to what you could hear? The answer is none. In other words, I am not situated like any of you. There are many things that apply to you that don't apply to me and vice versa. We rely on informed opinion of experts in real life all the time. Not sure why it is such a big deal to follow the same in audio.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#25
Only the blind test above will provide real proof.
The rest is anecdotal or at least suspect/questionable.
Oh, blind tests can be just as questionable. Blinding removes some bias, it doesn't automatically make it a truth teller. Real example:

We developed WMA audio codec to be twice as good as MP3. Marketing department comes and says they need an independent study to prove that. Knowing that we could not do that across all people and all content, I was worried but could not push back with any good reason. So we hired an outside agency to the tune of $25,000 to hire some 100 people to come and take a double blind listening test. The testing company resorted to ITU-R BS1116 as the protocol with this the gold standard for finding small impairments especially in lossy audio codecs. I was worried in a pre-call with the company until I heard what music they had selected: "audiophile classical music." Classical music is harmonic and as such, compresses a lot easier than say, rock music. The testing company proudly declared that since classical music was what audiophiles used to test gear, it surely made for best test material. WRONG!

They proceeded to hire the 100 people and what was the outcome? Better than 90% of the listeners thought that our codec at 64 kbps sounded the same as MP3 at double the rate at 128 kbps. Marketing was happy and press release declared the same.

Of course I had countless audio clips where the above was not true. That I could easily tell that we had issues with transient response and such at such low bit rate of 64 kbps (21 to 1 compression!). But here it was, a full, standard compliant, double blind test saying otherwise!

Where the testing company went wrong was that they did not know how lossy compression worked and as such, what kind of content would find issues with it. And further, the role of trained listeners in being able to hear artifacts that general public could not. They hired people from a local mall, gave them a few dollars to take the test which assured few if any critical listeners.

I listened to their test and despite them being easy to compress, I could tell we were not as good.


Blind tests are only good for controlling bias. That alone does not in any way assure that the truth is out, or that there is any real proof. As you know, Harman has used blind tests to say we like exaggerated bass. I am sure I have read that you don't agree with it. And I don't either. Yet, we have the research. We have the blind test. Clearly then no "proof" has been provided that at least picky audiophiles like exaggerated bass in headphones.

Conclusion
There is a key truth here: you must look at the specifics to know if a test -- blind or sighted -- is correct. The fact that a test is sighted doesn't automatically make the outcome wrong ("we are all human"). Or because a test is blind and is published, it automatically supports the general conclusions.

Sean Olive's papers always end with a lot of issues not addressed in the testing for above reason. Alas, people don't read those papers or if they read them, they ignore the qualifications.

I take a trained listener sighted tests over untrained and improper blind testing every day of the week and twice on sunday!
 

KeithPhantom

Active Member
Forum Donor
Joined
May 8, 2020
Messages
277
Likes
187
#26
There is a key truth here: you must look at the specifics to know if a test -- blind or sighted -- is correct. The fact that a test is sighted doesn't automatically make the outcome wrong ("we are all human"). Or because a test is blind and is published, it automatically supports the general conclusions.
This is what I would assert about this situation. Being a student of Finance and econometrics, statistics always tell you that a single, two, or sometimes, even three coefficients can't determine causation. Determining correlation is easy, a simple test statistic, p-value, r-squared can tell you that there may be some relation between the variables. But, if you apply an F-test and then apply another test based on the conditions you got, then your data doesn't correlate because there's another p-value with other distribution that doesn't follow the trend. Never a single piece will explain the causation or will serve as absolute evidence.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#27
Here is a specific example of fine print you must read per above testing of headphones: AES paper:
Factors that Influence Listeners’ Preferred Bass and Treble Balance in Headphones
Sean E. Olive and Todd Welti

1596575896193.png


How much do we emphasize level matching in blind tests? A ton, right? Yet this study made no attempt in keeping levels the same.

1596575972920.png


Three music tracks? Of 20 million albums out there with no standardization of their tonality during creation? Surely some music that is already bass heavy would be over the top when you boost that region again.

1596576053984.png


You don't think these are major limitations? Of course they are.

Now, the research is still highly valuable. But among ourselves -- the believers in research -- we need to be honest about how strong the evidence is. Heaven knows we better not ignore the limitations and stick our head in the sand "because the test is blind."

I would happily take a non-blind test without the above limitations and with trained listeners, over the above tests with limitations.

Don't generalize folks. That is my major beef with the few of you raising objections. You are not paying attention to the details that matter.
 
OP
P

patate91

Active Member
Joined
Apr 14, 2019
Messages
253
Likes
131
Thread Starter #28
@amirm I think at this point you should anwser/write to Sean Olive and explained where he was faulty with hi experiments about biaises. We could all benefits from this exchange.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#29
@amirm I think at this point you should anwser/write to Sean Olive and explained where he was faulty with hi experiments about biaises. We could all benefits from this exchange.
He knows what is faulty. He lists them in his papers! It is folks here who need correcting because a) they don't read the papers and b) run with talking points of very detailed research. Don't do that. Listen to someone who has read the papers, and can both defend and critique them.

And please don't try to be clever with debating tactics within your comment above. "You go talk to the researcher if you think you know more." I know more than you. That is the key point. Make yourself more knowledgeable so that you can defend yourself instead of using such tactics.

This is an opportunity to learn this topic from someone who practices. Don't blow it with comments like that.
 
OP
P

patate91

Active Member
Joined
Apr 14, 2019
Messages
253
Likes
131
Thread Starter #31
He knows what is faulty. He lists them in his papers! It is folks here who need correcting because a) they don't read the papers and b) run with talking points of very detailed research. Don't do that. Listen to someone who has read the papers, and can both defend and critique them.

And please don't try to be clever with debating tactics within your comment above. "You go talk to the researcher if you think you know more." I know more than you. That is the key point. Make yourself more knowledgeable so that you can defend yourself instead of using such tactics.

This is an opportunity to learn this topic from someone who practices. Don't blow it with comments like that.
In this thread I'm not debating with I asked you to debate with Olive. You know them, you certainly know how to contact them and you know more than me.

Again your exchange will benefits all of us.
 
OP
P

patate91

Active Member
Joined
Apr 14, 2019
Messages
253
Likes
131
Thread Starter #32
The conclusion of the above study is pretty clear.

This conclusion ans experiment is not mine. That's why I think people needs to challenge the experiment contents and the authors. That would be very productive and beneficial to the community
 

GD Fan

Active Member
Forum Donor
Joined
Jan 7, 2020
Messages
216
Likes
216
#33
Amazing how some here are really seeking a hill to die on. For my part, I'm grateful for all the work Amir does and count myself as lucky to have stumbled upon this place to learn from someone vastly more knowledgeable and experienced than I will ever be with these matters. Are his listening tests & opinions the final word on the matter? Maybe not, but it's an opinion I'd take over probably any other listening impression I could have access to.
 

Robbo99999

Addicted to Fun and Learning
Forum Donor
Joined
Jan 23, 2020
Messages
866
Likes
508
Location
UK
#34
In my humble opinion it is that both the subjective opinion and the score should not be, being ASR. Only the measurements should be published.
Klippel = science.
Amir = a human being.
But I also understand that: Amir = his rules.
That is more boring though, and possibly also less valuable in good information.
 

Blumlein 88

Major Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
9,632
Likes
12,864
#35
I do because that is the outcome we had at Microsoft in the team that I managed and the work I personally did myself. We hired trained listeners for this very purpose. They would have no value to us if they were as wrong as average Joe in sighted listening which was the bulk of our evaluation/testing.
I'm not criticizing here just trying to learn what is known and how it is done with what limits to expect of trained listeners doing sighted comparisons.

I think of what an optometrist does when you get an eye exam. Look at the letters on the wall. He makes a change, "is that better or worse?" A sighted sighted test. Even an untrained person can go from blurry vision to quite good vision to function in the everyday world this way.

I've fine tuned target curves in room correction for a number of people. I listen with them, they describe what they wished was different or maybe I suggest something I could make better. I tweak the target curve, and do much like the optometrist. I switch from the old one to the new one confirming it is better to them or not (as well as at some point seeing they aren't perceiving a difference as the changes get small enough). So a half hour or so and we can make real beneficial tweaks to the sound. Ones you can blind test afterwards to confirm they are audible and if you wish which is preferred. Had you made each tweak and then blind tested it before making the next you'd take probably a 100 hrs to accomplish the same real result. So blind testing is not always needed for every single thing.

But is there work done I'm not aware of checking trained sighted perceptions vs unsighted. I think of the JND metric. Around a db or so in loudness for most people if you just ask like the optometrist. Yet blind in a series of samples people can detect differences of .2 db in loudness.

A reference is important even sighted. That is where much audiophile sighted comparisons fall apart. The lack of ready switching to a reference. A reviewer gets speakers in, sets them up and listens making his pronouncements a couple weeks later. They should do something more like what you do which is put one speaker in channel A with a known reference in channel B.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#36
Again your exchange will benefits all of us.
I don't know about that given how little you want to listen to what I am explaining about the research.

Regardless, I spoke to Sean last year whether I should get into speaker or headphone testing. His answer: "I am only focused on headphone testing." Harman is a different company years later than when the research started and owned by Samsung. Business priorities are surely changed.
 
OP
P

patate91

Active Member
Joined
Apr 14, 2019
Messages
253
Likes
131
Thread Starter #37
I don't know about that given how little you want to listen to what I am explaining about the research.

Regardless, I spoke to Sean last year whether I should get into speaker or headphone testing. His answer: "I am only focused on headphone testing." Harman is a different company years later than when the research started and owned by Samsung. Business priorities are surely changed.

I asked him, I'll see what his anwser is.

For the research the conclusion is pretty clear. Yes you questions part of it, but no words on conclusion and the whole study.

Again two experts that exchange, because it seems that you have different positions (and I don't think they are at extrem), would benefits to people like me that are not expert. (IMO I would gives more weight to scientific papers over personnal opinion).
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
28,184
Likes
72,688
Location
Seattle Area
#40
Again two experts that exchange, because it seems that you have different positions (and I don't think they are at extrem), would benefits to people like me that are not expert. (IMO I would gives more weight to scientific papers over personnal opinion).
There is not an ounce of daylight between Sean and I on this topic. We both value blind and controlled testing to the limit.

What you don't understand and is not examined in research is usefulness of trained listeners (not just 'experienced listeners') in industry to get practical information on very fast turn around. And nowhere is fast turn around is more important than here where I am trying to test thousands of speakers before I get too old and cranky to do more.

A few of you are heavily interfering with this goal by creating FUD around the testing and at any rate, consume inordinate amount of time responding to you. Despite what you claim with yet another debating tactic, your fingers are deep into your ears, unwilling to listen to any explanation of science or engineering.
 
Top Bottom