Really? So if I do 10 trials and only use A and X then it isn't ABX test (according to you). I assume then, that when I use A, B and X in each trial then it is an ABX test. What if I use A and X in one trial and A, B and X in the rest. Is this ABX test? At which point in between is stops being ABX test?Of course you could use the software tool that way but if one is only listening to "A" and "X" he isn´t doing an ABX test anymore.
Hi Amir, the problem starts at the very beginning. The same outcome is NOT generated by AX and ABX. This was shown by a handful of papers in the 70's or 80's and contributed to the rise of signal detection theory (SDT) over threshold theory in perception studies. SDT and experiments find different outcomes, but threshold theory and you predict the same. If you are interested, I'll dig up the papers tomorrow. Sorry, can't today.If the same outcome is generated, using the same system, then it is the same test. Regardless of the setup, any ABX test can be run that way. So it cannot be made any different.
If your first trial is AX, then it never stops being ABX... because it never was ABX. You can't mix protocols and have a vote of trials to determine who wins the naming election.Really? So if I do 10 trials and only use A and X then it isn't ABX test (according to you). I assume then, that when I use A, B and X in each trial then it is an ABX test. What if I use A and X in one trial and A, B and X in the rest. Is this ABX test? At which point in between is stops being ABX test?
Hi. I am interested. So yes, let's see the papers.Hi Amir, the problem starts at the very beginning. The same outcome is NOT generated by AX and ABX. This was shown by a handful of papers in the 70's or 80's and contributed to the rise of signal detection theory (SDT) over threshold theory in perception studies. SDT and experiments find different outcomes, but threshold theory and you predict the same. If you are interested, I'll dig up the papers tomorrow. Sorry, can't today.
I understand what you are saying. But, if enough trials are run in double blind A/B and if the preference of A or B is statistically persistent for a given subject, can we not infer it is highly likely there is a difference perceived by that subject without the need to first prove the existence of that difference via ABX or other means?Why do you say some people do better in A/B comparisons than pure difference tests? My limited experiences is people feel better and believe they do better. When even a modicum of control is in place turns out not to be the case.
I think a big confounding issue is people wanting to go straight for preference when they can't or don't demonstrate they can hear a difference.
That's the same I've seen for myself. I do the abx the way you described it.Hi. I am interested. So yes, let's see the papers.
I have conducted many ABX tests this way and found detectable difference. Would be interesting to know why that is different than if I had also clicked on "B." What I find that it can actually reduce my acuity if I try to play A and B constantly. It adds to distraction of the test/challenges short-term memory.
Hi. I am interested. So yes, let's see the papers.
I have conducted many ABX tests this way and found detectable difference. Would be interesting to know why that is different than if I had also clicked on "B." What I find that it can actually reduce my acuity if I try to play A and B constantly. It adds to distraction of the test/challenges short-term memory.
Yes if a test shows a persistent difference even for preference you don't need to run one for difference.I understand what you are saying. But, if enough trials are run in double blind A/B and if the preference of A or B is statistically persistent for a given subject, can we not infer it is highly likely there is a difference perceived by that subject without the need to first prove the existence of that difference via ABX or other means?
Well you have to be careful with this. Stereo vs Mch should get a quick and easy 100% result. Speakers different enough it shouldn't be a problem. Yet time and time again I see audiophiles proclaim the difference is so blindingly obvious no blind test is needed about things for which it is highly unlikely there is an audible difference.Also, sometimes the difference is obvious. Would you need a formal DBT to first establish a difference between stereo and Mch playback prior to deciding on a preference? If the difference is, duh, obvious, why do an ABX test?
A/B is what I have been doing all my life, and I feel much more comfortable with it than ABX. To me, A/B is much more intuitive and natural. I find that the concept of identifying, matching and correlating X with either A or B to be difficult, non-intuitive, and, yes, time consuming, requiring more replays and switches, adding to test fatigue and adversely affecting results. I also think ABX may have an inherent bias against finding a difference that might be there, though subject training might mitigate that to a degree.
Also, on many occasions, I have frequently concluded in A/B that there is no significant difference, therefore no preference, even sighted with all the biases and baggage that entails. I might be wrong in some of those assessments, but I am frequently wrong in ABX, too, where there is a forced choice of A or B. Maybe, like Amir, I need to do a lot more learning about how to be a good ABX test taker. But, maybe it would be much more fun to just listen to a lot more music instead of conducting ABX training sessions on myself.
And, what do we make of Toole's ABCD speaker tests at Harman, which are not ABX? Yes, with speakers, it might be reasonable to accept that no two speakers are likely to sound exactly the same. Usually, this is obvious to even a casual listener. So, a difference can be assumed as sufficiently likely. But, note that test subjects are not required to try to painstakingly figure out whether ABC or D is playing now, as with ABX. They just indicate their preference of it, whichever it is, relative to the other choices.
Personally, I am much more interested in the question of preference, as long as there is some way to reasonably and objectively infer that there is a difference. Agreed, some people might not do a good enough job of establishing a perceivable difference beyond reasonable doubt before jumping to preference. ABX testing, which is only for difference, may be useful in some specific cases, but I think it begs the more important, more useful question of preference. Although, as we also know, sometimes there is a difference, but no clear preference.
Dead on.Another thing to wonder about is that duo-trio, triangle and AFC testing is used mostly for tastes and smells. With audio you have the issue of echoic memory. The difference in discrimination using ABX testing is vast when you use 4 second or less samples with instant switching vs using longer samples. Even 15 second samples will dramatically reduce your ability to detect actual differences. Yet if I were doing preference testing I don't think I can arrive at a preference in less than 15 seconds and don't really feel it for samples of less than 30 seconds. Are my feelings correct however?
If the same outcome is generated, using the same system, then it is the same test. Regardless of the setup, any ABX test can be run that way. So it cannot be made any different.
Really? So if I do 10 trials and only use A and X then it isn't ABX test (according to you). I assume then, that when I use A, B and X in each trial then it is an ABX test. What if I use A and X in one trial and A, B and X in the rest. Is this ABX test? At which point in between is stops being ABX test?
From: Macmillan, N. A., & Creelman, C. D. (2005). Detection Theory: A User’s Guide. Mahwah, New Jersey: Lawrence Erlbaum Associates. Page 235
A comparison of Equations 9.5 and 9.13 reveals that, according to threshold theory, the value predicted for proportion correct in same-different is exactly the same as in ABX. Experiments that have compared the two paradigms (Creelman & Macmillan, 1979; Pastore, Friedman, & Baffuto, 1976; Rosner, 1984) generally have not supported this prediction, but have instead found p(c) to be higher in ABX, consistent with SDT analysis.
Creelman, C. D., & Macmillan, N. A. (1979). Auditory phase and frequency discrimination: A comparison of nine paradigms. Journal of Experimental Psychology: Human Perception and Performance, 5,146-156.
Pastore, R. E., Friedman, C. J., & Baffuto, K. J. (1976). A comparative evaluation of the AX and two ABX procedures. Journal of the Acoustical Society of America, 60, S120 (Abstract).
Rosner, B. S. (1984). Perception of voice-onset-time continua: A signal detection analysis. Journal of the Acoustical Society of America, 75, 1231-1242.
If you consistently find a difference, this doesn't throw into doubt the difference is real. What is at stake is that one method may fail to detect a difference that another method does detect. Some of what Jakob has quoted talks about food taste testing. A triangle test failed to find a difference. A test using 3AFC where you are asked to find the sweetest of the samples (I forget what it was the sweetness is an example), was successful in showing which was sweetest. So if a test of purely same or different failed, something is different when a test for strength of difference is able to find a statistically valid result. You would have naively expected both tests to work.
If someone does a listening test comparing the hypothetical Widget 100 Black to a similar item in your system showing an improvement using a method that convinces you, and the price of the Widget is acceptable to you, then you might compare it yourself and buy it if you think it's worth it. Voila, improvement.What improvements can we hope to hear in our systems as a result of these listening tests of various durations?
Another thing to wonder about is that duo-trio, triangle and AFC testing is used mostly for tastes and smells.
With audio you have the issue of echoic memory.
The difference in discrimination using ABX testing is vast when you use 4 second or less samples with instant switching vs using longer samples. Even 15 second samples will dramatically reduce your ability to detect actual differences. Yet if I were doing preference testing I don't think I can arrive at a preference in less than 15 seconds and don't really feel it for samples of less than 30 seconds. Are my feelings correct however?
Myself if you are testing for preference prefer the 2AFC or 3AFC method. I prefer this for two reasons. Firstly there is no cognitive load to detect a difference. You know the two samples differ in some way. You are not given the same thing twice. The second is it being statistically easier. If someone scores 75% correct choices that is enough the result is 5% or less likely a random result.
What's the point of doing scientific listening tests if people then have to do their own non-scientific listening tests to decide? If I am convinced by the method, shouldn't I just accept the results? - it is science after all. Isn't that like Amir doing measurements of DACs but recommending that we all do our own measurements before purchase just to be on the safe side?If someone does a listening test comparing the hypothetical Widget 100 Black to a similar item in your system showing an improvement using a method that convinces you, and the price of the Widget is acceptable to you, then you might compare it yourself and buy it if you think it's worth it. Voila, improvement.
I could be wrong! For me the motivation behind the experiments is more important than the low level details - but most people prefer talking about the low level details. My suspicion is that people are more in love with the methodology and the lovely statistics than having any expectation that it will ever generate anything useful. It may generate lots of lovely tables and histograms that can be published and read by other people interested in the methodology, but that's not the same as something that's useful!But wait, aren't you a listening test sceptic? That's fine, but if you can describe what method of listening test would convince you, QED. If there is no such thing as convincing you, isn't your question disingenuous?
Think in terms of convincing vs. not convincing, not scientific vs. non-scientific. You define the former for yourself, and no one can say you are wrong... although they may try to help with info. The latter has no universally accepted definition. The reason you would repeat it for yourself is inter-subject variability and individual value judgements. If someone convinces you that a difference is audible, can you hear it? If you can definitely hear it, is it worth the money? Listening yourself is of course optional... it's your money.What's the point of doing scientific listening tests if people then have to do their own non-scientific listening tests to decide? If I am convinced by the method, shouldn't I just accept the results? - it is science after all. Isn't that like Amir doing measurements of DACs but recommending that we all do our own measurements before purchase just to be on the safe side?
I completely agree that the motivation is also important. It really should be stated (and usually is in a peer-reviewed article). But the low level details relate to the convincingness. The reason so many discuss the details is, I suspect, not so much a love of details, but rather people trying to say either "You should be convinced about what I say" (because of these details) or "I am not convinced" (because of these details). If I say 7 heads out of 10 coin tosses does not convince me the coin is rigged, but 700 heads out of 1000 tosses, and I'll cry foul. Many may complain I'm not making sense. I can only clarify using low level details.I could be wrong! For me the motivation behind the experiments is more important than the low level details - but most people prefer talking about the low level details. My suspicion is that people are more in love with the methodology and the lovely statistics than having any expectation that it will ever generate anything useful. It may generate lots of lovely tables and histograms that can be published and read by other people interested in the methodology, but that's not the same as something that's useful!
I follow everything and take your points well, except "is high res worth it?" This is again: inter-subject variability and individual value judgements. If you are not convinced, save your money!Thirty years later, is CD transparent? "Ah well, that depends what you mean by transparent...". OK, is CD audibly the same as high res? "Ah well, you see, it depends on what you mean by audibly the same...". OK, is high res worth it? "Ah well, some meta-analysis suggests that under some circumstances then there may be evidence that it could sound different. More testing is needed...". Etc.!