We have had at least on thread on the topic. Read what is said on page 2 especially by J_J in post #27. People speak of a properly done double blind test. Post #27 summarizes what is needed. I think very few of us have ever done any such thing.
OK, my fingers are tired. :) I hope others start to contribute and we have a starting point.... I will be super disappointed if this work does not conclude and members don't contribute to it. OP is doing us a great favor by creating this doc so that we can reference in the future than...
www.audiosciencereview.com
So if you aren't doing a proper test are they worthless, worthwhile, how careful should you be of what you claim? I mean a score of 63 out of 100 means there is only a 1% chance the result is random and not real. You need 16 out of 20 in a shorter test. Likely the audibility of some factor must be larger to be sure with only 20 trials vs the same probability with more trials. That means there some number of edge cases where you get a null result with 20 trials when the truth is not a null (those type 2 errors).
For my self, I find doing 10 of your usual ABX trials relatively doable without terrible tedium. If I do two of those with 20 of 20, I feel sure I can hear something. But even that is not a "properly done ABX test" by research criteria.
PS: I too would like for Thorsten Loesch to give us one example of what he considers a good way to do it. He answered in one post (I think in another thread) in general terms, but people will have a better idea if we had real example or one he can imagine meeting his standards. He seems not interested in providing that.
PPS: This is what Mr. Loesch replied to earlier in what is needed.
Why is it so hard to get my position right? I criticise Audio ABX as practiced on a number of grounds, all strait forward and all related easy to understand flaws, including Methodology and use of statistics. The result of these flaws is that the Audio ABX test is very heavily weighted towards...
www.audiosciencereview.com
1) Make sure the test is BLIND. That is, the test subjects should not have any awareness of what is being tested, so they cannot have any bias on the subject. This is specific to Audio (though is also useful in other contexts where strong emotions are attached to views on the subject of the investigation) as we had five decades or so of extreme polarisation.
2) Make sure the test minimises test induced stress, this involves protocol, environs and general interactions with listeners. They are not enemies to be defeated, but resources to employed in the search for knowledge. Make the listeners comfortable and relaxed, make them feel they are giving a real contribution, not matter what your personal view on the matter. If necessary employ someone to be nice if you cannot be.
3) Use a form of preference / performance ranking, it not only gives more information, humans are much more consistent in their preference than their ability to correctly identify a specific item. Collect as much as data as possible. Use questionaries' that assess the emotional / mental state of the subject as well. Test for reliable preference and reliable alteration of mood/emotional state as proxies to the presence of a potential difference, rather than attempting to test the difference directly.
4) Use whatever statistics you like, but be clear to everyone, your listeners as well as the audience of the test what the limitations and implications are UPFRONT.
Is that straightforward and detailed enough?
Thor