Listening Test Duration

Wombat · Jan 16, 2018

FYI: https://www.researchgate.net/public...f_Test_Duration_in_Subjective_Listening_Tests

amirm · Jan 16, 2018

Interesting read. It is a complex area that is not researched well. When testing small differences, it is very common for people to give up and vote randomly just to get through it all.

Jakob1863 · Jan 16, 2018

It is another variable that has to be considered in planning a listening experiment.
Especially as there is evidence from several studies that longer music sample excerpts are associated with better discrimination ("longer" means in this case longer than the often used 5 -7 s excerpts) which obviously leads to longer test durations everything else remaining unchanged.

Even more pronounced this problem gets if an experimenter additionally considers the emotinal response which might evolve over more extended length of musical excerpts, because that can be in the range of minutes.

Therefore it seems to be a good advice to combine quantitive methods (i.e. the conventional listening test approach) with qualitative methods, where effects that are only assessed with difficulties within the usual tests, might be revealed.

oivavoi · Jan 16, 2018

Jakob1863 said:
Therefore it seems to be a good advice to combine quantitive methods (i.e. the conventional listening test approach) with qualitative methods, where effects that are only assessed with difficulties within the usual tests, might be revealed.

Do you know of any qualitative studies into audio perception, Jakob? I think I read one some time ago, but forgot to save it for later.

Jakob1863 · Jan 16, 2018

There were at least several attempts to use a combination of qualitative and quantitative methods over the years. I´ll cite/link some later or tomorrow.....

Nyberg wrote a thesis about :

Dan Nyberg, An Investigation of Qualitative Research Methodology for Perceptual Audio Evaluation :

https://www.diva-portal.org/smash/get/diva2:990443/FULLTEXT01.pdf

astr0b0y · Jan 16, 2018

This is interesting to me. For myself, I find my mood and general state of mind has an enourmous effect on my perception of sound quality.

oivavoi · Jan 16, 2018

Jakob1863 said:
There were at least several attempts to use a combination of qualitative and quantitative methods over the years. I´ll cite/link some later or tomorrow.....

Nyberg wrote a thesis about :

Dan Nyberg, An Investigation of Qualitative Research Methodology for Perceptual Audio Evaluation :

https://www.diva-portal.org/smash/get/diva2:990443/FULLTEXT01.pdf

Thanks! I think a paper by Nyberg actually was the one I remembered reading!

Jakob1863 · Jan 16, 2018

oivavoi said:
Thanks! I think a paper by Nyberg actually was the one I remembered reading!

You´re welcome.

Jan Berg also wrote some papers and introduced a software tool called OPAQUE.
His reference lists include several publications from Choisel/Wickelmaier and Rumsey, Zielinski as well, that i had in mind too.

amirm · Jan 16, 2018

astr0b0y said:
This is interesting to me. For myself, I find my mood and general state of mind has an enourmous effect on my perception of sound quality.

Definitely. When I am challenged, I am the best at finding and hearing small differences.

If not, it is hard to get motivated to do the same.

sergeauckland · Jan 16, 2018

amirm said:
Interesting read. It is a complex area that is not researched well. When testing small differences, it is very common for people to give up and vote randomly just to get through it all.

To me, that says that there are no differences, or so small as to be negligible. Maybe it's due to my age, or jaded by getting on for 50 years in audio engineering, but I can no longer get excited by small trivial differences in audio. So when I do listening tests, and I still do them as needed, if the difference isn't obvious, if I have to listen carefully going back and forth to decide on something, it's not worth the effort. As far as ;'m concerned, then , there IS no difference. Or at least, not one worth bothering about.

I've not been a fan of ABX testing. I much prefer AA, AB, BB, BA testing where the choice is same/different. I've found that to be very good for finding out whether a difference exists at all. If no difference exists, then any discussion of which is better becomes meaningless. If a difference exists, which can very easily be found from a statistical analysis of the responses, then one can start looking for what sort of differences, and expressing preferences.
Far too many tests, I think, try and decide on which is better before even deciding on whether they're different at all.
S.

amirm · Jan 16, 2018

sergeauckland said:
To me, that says that there are no differences, or so small as to be negligible. Maybe it's due to my age, or jaded by getting on for 50 years in audio engineering, but I can no longer get excited by small trivial differences in audio. So when I do listening tests, and I still do them as needed, if the difference isn't obvious, if I have to listen carefully going back and forth to decide on something, it's not worth the effort. As far as ;'m concerned, then , there IS no difference. Or at least, not one worth bothering about.

It certainly indicates that for the population at large. I do the testing to counter the "it can't be audible." Folks say that without understanding the technology, nor bothering to ever test their assumptions. So I do the test to show that when I say something can be audible, it indeed can be. And that there is a lien we don't want to cross and still say the system is transparent.

But yes, seeing how few to no one goes and replicates my results, and having tested so many others in controlled testing, that my successes don't indicate even the audiophiles can hear them.

Jakob1863 · Jan 16, 2018

sergeauckland said:
To me, that says that there are no differences, or so small as to be negligible. Maybe it's due to my age, or jaded by getting on for 50 years in audio engineering, but I can no longer get excited by small trivial differences in audio. So when I do listening tests, and I still do them as needed, if the difference isn't obvious, if I have to listen carefully going back and forth to decide on something, it's not worth the effort. As far as ;'m concerned, then , there IS no difference. Or at least, not one worth bothering about.

I've not been a fan of ABX testing. I much prefer AA, AB, BB, BA testing where the choice is same/different. I've found that to be very good for finding out whether a difference exists at all. If no difference exists, then any discussion of which is better becomes meaningless. If a difference exists, which can very easily be found from a statistical analysis of the responses, then one can start looking for what sort of differences, and expressing preferences.
Far too many tests, I think, try and decide on which is better before even deciding on whether they're different at all.
S.

The interesting part is that a "same/different" test seems to be even more challenging for participants than an ABX test, not subjectively but wrt the false response proportion, which is usually astonishingly high in the "same trials" (i.e. AA and BB) being usually around 70 - 80 % .

Although is seems to be plausible to listen first for a difference and then for a possible preference (and is usually done in most cases in this order in industrial consumer tests ), in practice most listeners seem to be instantly "looking" for a preference when comparing two devices.
That is in my experience the reason that most people seem to do better in A/B comparisons than in pure difference tests, although they usually still need some time for accommodation (avoiding the somewhat misleading term "training") .

Blumlein 88 · Jan 17, 2018

Jakob1863 said:
The interesting part is that a "same/different" test seems to be even more challenging for participants than an ABX test, not subjectively but wrt the false response proportion, which is usually astonishingly high in the "same trials" (i.e. AA and BB) being usually around 70 - 80 % .

Although is seems to be plausible to listen first for a difference and then for a possible preference (and is usually done in most cases in this order in industrial consumer tests ), in practice most listeners seem to be instantly "looking" for a preference when comparing two devices.
That is in my experience the reason that most people seem to do better in A/B comparisons than in pure difference tests, although they usually still need some time for accommodation (avoiding the somewhat misleading term "training") .

Why do you say some people do better in A/B comparisons than pure difference tests? My limited experiences is people feel better and believe they do better. When even a modicum of control is in place turns out not to be the case.

I think a big confounding issue is people wanting to go straight for preference when they can't or don't demonstrate they can hear a difference.

sergeauckland · Jan 17, 2018

amirm said:
It certainly indicates that for the population at large. I do the testing to counter the "it can't be audible." Folks say that without understanding the technology, nor bothering to ever test their assumptions. So I do the test to show that when I say something can be audible, it indeed can be. And that there is a line we don't want to cross and still say the system is transparent.

But yes, seeing how few to no one goes and replicates my results, and having tested so many others in controlled testing, that my successes don't indicate even the audiophiles can hear them.

Measurements will always show a difference. Even two samples of the same product will have differences (hopefully small) but differences nevertheless if one looks closely enough. The issue then becomes one of understanding what the thresholds for audibility are. Back in the mists of time, it was discovered that generally, 1% THD was the point at which most listeners found distortion to be inaudible. This of course depends on the harmonic structure, 1% 7th harmonic would be a lot more audible than 1% 2nd harmonic, but it would take an odd piece to equipment to generate 7th harmonic only. Leaving that aside, for normal 2nd,3rd harmonic content 1% was about right for most listeners on speech or music. Consequently, on the basis that measuring equipment ought to be at least one order of magnitude better than what's being measured, it was felt that if an amplifier had less than 0.1% THD, then that was completely inaudible under all circumstances.
Of course there may be the odd lucky (or cursed) person that could hear 0.1%, but it could be taken as statistically insignificant.

Now, marketing got in the way, and some manufacturers started using the 0.1% as some sort of slogan, even though they conveniently missed the point that their amps could do 10 watts and 0.1% distortion, but not at the same time, and not into less than a 16 ohm load, and only at 700Hz!

Now, measuring two amplifiers, one with 0.01% THD and one with 0.001%THD, all other things being equal, is it sensible to think that they could sound different?

Measuring two samples of the same product, one could have 0.001112% THD, the other 0.00113% THD. They are different, but is it in any way audible?

I bang on about transparency, and straight-wire bypass tests because my feeling is that it's audibility that is important, and relating measurements to that is why I do measurements, to satisfy myself that what I'm using is either transparent, or I know what its limitations are.
S.

Jakob1863 · Jan 17, 2018

Blumlein 88 said:
Why do you say some people do better in A/B comparisons than pure difference tests?

Because that it is what the data suggests.

My limited experiences is people feel better and believe they do better. When even a modicum of control is in place turns out not to be the case.

To have participants in a test feel comfortable and confident basically isn´t a bad thing. But wrt "modicum of control" i´m talking about results of controlled sensory experiments.
As said before, people not used to do controlled tests still need some accommodation time even if the test protocol does fit better to their usual/normal routine.

I´ve cited some results from comparion of ABX to other protocols (like A/B) but the paired comparison used as a "same/different" is a more difficult one. I´ve cited already earlier a crosscultural study comparing the results when using this "same/different" protocol. The proportion of false responses in the "same trials" is a bit different between different countries (afair one of the asian countries showed the highest "miss rate" ) but is surprisingly large in all countries and is robust over different product categories. The proportion of false responses is usually somewhere between 70 - 80% when evaluation the same stimulus in listening tests, in food tests and even in tests of cigarettes (wrt certain features).

This represents a problem in the traditional statistical analysis but offers the possibility of a more modern analysis where the results of the "same trial" establish a socalled identicallity norm to which the results of the "different trials" can be compared. But that is up to now not trickled down to the normal test routines.

I think a big confounding issue is people wanting to go straight for preference when they can't or don't demonstrate they can hear a difference.

In fact - as outlined above - the big confounding issue is the presentation of "same trials" in tests. As said before, in my experience people are always instantly evaluating if they prefer something when comparing "things", they apparently don´t care so much about a processing order where they should first find a difference and than a preference.
If an established preference exists it means that a difference must exist; in the converse it must not be so.

Of course there exist different models of the internal evalution/judgement processes of humans and which one approximate better a specific situation varies. Researchers are often amazed about the differences between model predictions and real world results.

Blumlein 88 · Jan 17, 2018

Jakob1863 said:
Because that it is what the data suggests.

To have participants in a test feel comfortable and confident basically isn´t a bad thing. But wrt "modicum of control" i´m talking about results of controlled sensory experiments.
As said before, people not used to do controlled tests still need some accommodation time even if the test protocol does fit better to their usual/normal routine.

I´ve cited some results from comparion of ABX to other protocols (like A/B) but the paired comparison used as a "same/different" is a more difficult one. I´ve cited already earlier a crosscultural study comparing the results when using this "same/different" protocol. The proportion of false responses in the "same trials" is a bit different between different countries (afair one of the asian countries showed the highest "miss rate" ) but is surprisingly large in all countries and is robust over different product categories. The proportion of false responses is usually somewhere between 70 - 80% when evaluation the same stimulus in listening tests, in food tests and even in tests of cigarettes (wrt certain features).

This represents a problem in the traditional statistical analysis but offers the possibility of a more modern analysis where the results of the "same trial" establish a socalled identicallity norm to which the results of the "different trials" can be compared. But that is up to now not trickled down to the normal test routines.

In fact - as outlined above - the big confounding issue is the presentation of "same trials" in tests. As said before, in my experience people are always instantly evaluating if they prefer something when comparing "things", they apparently don´t care so much about a processing order where they should first find a difference and than a preference.
If an established preference exists it means that a difference must exist; in the converse it must not be so.

Of course there exist different models of the internal evalution/judgement processes of humans and which one approximate better a specific situation varies. Researchers are often amazed about the differences between model predictions and real world results.

What about triangle tests? Or duo-trio tests?

I personally like the triangle tests.

Jakob1863 · Jan 17, 2018

Blumlein 88 said:
What about triangle tests? Or duo-trio tests?

I personally like the triangle tests.

Difficult to answer although both offer the advantage of fewer trials needed because probability to be correct by chance is only 1/3.
But a closer look shows that the proportion of correct answers differ between the methods, although intuitively one tend to think that the methods (ABX, triangle and duo-trio) are quite similar and therefore results should be nearly identical.

Quite early there was the socalled "discriminatory nondiscriminators paradoxon" observed, where participants weren´t very good at sorting the odd one out (triangle test) while answering (in the same test trial) the question for the weakest or strongest probe more often correctly.
Frijters offered an explanation in his paper from 1979 and showed that the paradoxon could be resolved by using psychometrical reformulation and showed that each case was differently assessed by different models of the internal evaluation processes. The different instructions for the tasks lead to drastically different internal decision processes. (all said in short therefore neglecting the specific reasoning).
Later in other publications it was shown that additionally there exists a mix with a presentation order effect, therefore results were different depending on strongest or weakest samples presented first in the trials.

Therefore, as said already in another post, it is imo most important to find a test method that fits the individual abilities of a/the participant(s) especially if the number of test subjects is small. Further it is important to check by using positive controls (and negative as well, although for other reasons) if a sufficient sensitivity of a participant(s) under the specific test conditions is reached.

J.E.R.Frijters, The paradox of discriminatory nondiscriminators resolved, Chemical Senses and Flavour, Volume 4, Number 4, 1979, 355

danadam · Jan 17, 2018

sergeauckland said:
I've not been a fan of ABX testing. I much prefer AA, AB, BB, BA testing where the choice is same/different.

Not sure I understand the difference. If you listen to only A and X in ABX test and answer same/different as "X is A"/"X is B", isn't that the same as your AA, AB, ...?

Jakob1863 · Jan 17, 2018

danadam said:
Not sure I understand the difference. If you listen to only A and X in ABX test and answer same/different as "X is A"/"X is B", isn't that the same as your AA, AB, ...?

Could be but then it isn´t an ABX test anymore.

amirm · Jan 17, 2018

Jakob1863 said:
Could be but then it isn´t an ABX test anymore.

It is actually because you can take the ABX test that way even if B is presented.

Listening Test Duration

Master Contributor

Founder/Admin

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Active Member

Major Contributor

Addicted to Fun and Learning

Founder/Admin

Major Contributor

Founder/Admin

Addicted to Fun and Learning

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Grand Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Addicted to Fun and Learning

Founder/Admin

Similar threads