• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Listening Test Duration

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,376
Likes
234,559
Location
Seattle Area
One thing to consider is that there are different outcomes in ABX. I my tests of that type, I strive for almost complete detection. At best I allow a miss due to inattention. Otherwise, I can take the test 100 times and get 100 right answers. Signal theory, or other experimental doubts doesn't enter into such results. I "classify" A and B first before doing anything with X. Once I am solid on the difference, then I can identify if X is A or not. That is why I said the outcome here is invariant whether you do AX or ABX. And that performing ABX adds distraction and nothing more.

Now if I had a population of testers with some statistical outcome like p = 0.6, then we would worry about secondary effects. The whole notion of 0.5 threshold itself has no science that backs it. It was an arbitrary confidence level.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,525
Likes
37,058
One thing to consider is that there are different outcomes in ABX. I my tests of that type, I strive for almost complete detection. At best I allow a miss due to inattention. Otherwise, I can take the test 100 times and get 100 right answers. Signal theory, or other experimental doubts doesn't enter into such results. I "classify" A and B first before doing anything with X. Once I am solid on the difference, then I can identify if X is A or not. That is why I said the outcome here is invariant whether you do AX or ABX. And that performing ABX adds distraction and nothing more.

Now if I had a population of testers with some statistical outcome like p = 0.6, then we would worry about secondary effects. The whole notion of 0.5 threshold itself has no science that backs it. It was an arbitrary confidence level.

Did you mean .05 threshold?

Yes, Mr. Fisher just picked it arbitrarily.

Physical sciences and QC in manufacturing quickly found things work better if you only pay attention to p values of .0062 or .0027 (2.5 or 3 sigma results). In fact using 5% for manufacturing QC typically worsened quality over doing nothing. 19 out of 20 would be 4 sigma results. While that might not track sensory detection theories it nearly wipes out false positives. False positives would cause more trouble in audio than false negatives.
 
Last edited:

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
A recent Nature article suggesting any sciences that use p=.05 as a threshold replace it with p=.005 immediately.

https://www.nature.com/articles/s41562-017-0189-z
This is what amuses me about statistics! It is its own little circular world where people gravely state "This is statistically significant..." when, in fact, the definition of that is arbitrary and subjective. But that's only half of it.

They now think that the endemic lack of repeatability in 'soft' 'sciences' is due to an overly lax p-value threshold, when it is just as likely to be due to experimental errors, or naivety. If, for example, an experiment purports to demonstrate that people prefer the sound of low bit-rate MP3 over CD, then changing the p-value threshold for significance won't correct the fact that maybe their experiment used only students who have grown up with MP3 and have strong emotional associations with the sound of it. Try to repeat the experiment in 10 years time and it won't work. The statistical part is just a distraction activity that people ("I didn't know I could do maths, but this is easy!") enjoy, and takes attention away from the fuzzy, non-deterministic system that is human consciousness and culture.
 

oivavoi

Major Contributor
Forum Donor
Joined
Jan 12, 2017
Messages
1,721
Likes
1,935
Location
Oslo, Norway
This is what amuses me about statistics! It is its own little circular world where people gravely state "This is statistically significant..." when, in fact, the definition of that is arbitrary and subjective. But that's only half of it.

They now think that the endemic lack of repeatability in 'soft' 'sciences' is due to an overly lax p-value threshold, when it is just as likely to be due to experimental errors, or naivety. If, for example, an experiment purports to demonstrate that people prefer the sound of low bit-rate MP3 over CD, then changing the p-value threshold for significance won't correct the fact that maybe their experiment used only students who have grown up with MP3 and have strong emotional associations with the sound of it. Try to repeat the experiment in 10 years time or using subjects from a different demographic and it won't work. The statistical part is just a distraction activity that people ("I didn't know I could do maths, but this is easy!") enjoy, and takes attention away from the fuzzy, non-deterministic system that is human consciousness and culture.

Agree to a large extent. In social science (my field), I see correlational statistics as mostly worthless. It can be used descriptively, to give an approximate snapshot of reality, but nothing more. Longitudinal research designs, on the other hand, make me happy deep down in my soul.
 

SoundAndMotion

Active Member
Joined
Mar 23, 2016
Messages
144
Likes
111
Location
Germany
Green Jelly Beans Linked to Acne!

(credit goes to XKCD)

significant.png
 
Last edited:

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
What's the point of doing scientific listening tests if people then have to do their own non-scientific listening tests to decide? If I am convinced by the method, shouldn't I just accept the results? - it is science after all. Isn't that like Amir doing measurements of DACs but recommending that we all do our own measurements before purchase just to be on the safe side?

As SoundandMotion already pointed out, it is about what _you_ could hear and even more important :) , if it is _really_ _important_ to you when listening to music.
I´ve asked it several times (but never got an answer) what people should do if somebody somewhere could detect differences in controlled listening test. Do they have to buy "blindly" new stuff according to those results or should they better try for themselves? (And as a subsequent question, couldn´t they try themselves anyway?).

And of course one should look at the hypothesis examined by any experiment and should evaluate to what group of people any test result could be expanded/generalized.

I could be wrong! For me the motivation behind the experiments is more important than the low level details - but most people prefer talking about the low level details. My suspicion is that people are more in love with the methodology and the lovely statistics than having any expectation that it will ever generate anything useful. It may generate lots of lovely tables and histograms that can be published and read by other people interested in the methodology, but that's not the same as something that's useful!

As said earlier, the most important part is to learn to listen, especially for evaluation purposes. When doing "blind" listening tests, it´s not needed to emphasize on the "scientific" part, as doing something right is important in any case, less important is if you could call it a "scientific" experiment wrt all details.
One gets wrong/misleading results using controlled listening tests as easy as with less controlled "sighted" listening.

Thirty years later, is CD transparent? "Ah well, that depends what you mean by transparent...". OK, is CD audibly the same as high res? "Ah well, you see, it depends on what you mean by audibly the same...". OK, is high res worth it? "Ah well, some meta-analysis suggests that under some circumstances then there may be evidence that it could sound different. More testing is needed...". Etc.!:)

I understand that it might be annoying, but it is important that _my_ "transparent or not transparent" might not match _yours_ . Stating something like "transparent for _every_ human being" is usually not warranted, surely not for CD .
 

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
One thing to consider is that there are different outcomes in ABX. I my tests of that type, I strive for almost complete detection. At best I allow a miss due to inattention. Otherwise, I can take the test 100 times and get 100 right answers. Signal theory, or other experimental doubts doesn't enter into such results. I "classify" A and B first before doing anything with X. Once I am solid on the difference, then I can identify if X is A or not. That is why I said the outcome here is invariant whether you do AX or ABX. And that performing ABX adds distraction and nothing more.

Now if I had a population of testers with some statistical outcome like p = 0.6, then we would worry about secondary effects. The whole notion of 0.5 threshold itself has no science that backs it. It was an arbitrary confidence level.
But it reflects what theory already suggests, namely that the different experimental procedures lead to (more correctly might/can lead to) different results, which is why i emphasized in another thread the role of accommodation/training and was so surprised about your posts in that thread.
Your and Blumlein88´s method illustrated the difficulties tied to the ABX protocol and this method isn´t something an test novice will come up with at his first attempt (at least in my experience and considering the posts/publications about it).

In the Fisherian framework the 0.05 isn´t a "confidence" level but a level of "significance" and imo not completely arbitrary as it seems to be a good compromise number considering the reduced control about everything in real life experiments.
 

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
<snip> While that might not track sensory detection theories it nearly wipes out false positives. False positives would cause more trouble in audio than false negatives.

I´m interested to learn about why it false negatives are less harmful than false positives in audio.

A recent Nature article suggesting any sciences that use p=.05 as a threshold replace it with p=.005 immediately.

https://www.nature.com/articles/s41562-017-0189-z

More precisely the authors restricted their postulated requirement to experiments claiming to have found _new_ effects and furthermore to those following the usual NHST routine.
The publication is an interesting illustration of the very different approaches of Frequentists and Bayesians on test concepts. Beside the fact that the authors use (surprisingly) a quite misleading approach to power they imo failed to communicate clearly that the main problem is using "blindly" a _routine_ for conclusion/decision.

Therefore the question is, if it really helps to set the decision criterion at a lower number when people are still mechanically/blindly using the flawed way of decision making.

Using just another (although lower) number will at first just raise the probability of errors of the second kind, as it lowers the power of experiments, but will not prevent from using the wrong way of reasoning.

From what i´ve read it wasn´t neither Fisher´s idea that an experimentor should use in every experiment the same "sacrified" decision criterion nor that he/she should proudly claiming extraordinary new effects after just doing one experiment with a statistically significant result.
 
Last edited:

Thomas savage

Grand Contributor
The Watchman
Forum Donor
Joined
Feb 24, 2016
Messages
10,260
Likes
16,298
Location
uk, taunton
This is what amuses me about statistics! It is its own little circular world where people gravely state "This is statistically significant..." when, in fact, the definition of that is arbitrary and subjective. But that's only half of it.

They now think that the endemic lack of repeatability in 'soft' 'sciences' is due to an overly lax p-value threshold, when it is just as likely to be due to experimental errors, or naivety. If, for example, an experiment purports to demonstrate that people prefer the sound of low bit-rate MP3 over CD, then changing the p-value threshold for significance won't correct the fact that maybe their experiment used only students who have grown up with MP3 and have strong emotional associations with the sound of it. Try to repeat the experiment in 10 years time and it won't work. The statistical part is just a distraction activity that people ("I didn't know I could do maths, but this is easy!") enjoy, and takes attention away from the fuzzy, non-deterministic system that is human consciousness and culture.
I agree 100%..
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,376
Likes
234,559
Location
Seattle Area
Top Bottom