No, it means you got a specific number of correct/incorrect trials in the test and then a specific set of statistical analysis was applied to draw a conclusion.
The conclusion has two confidence numbers attached, namely the risk of false positives and the risk of false negatives.
This, like so many things has a triangle with three sides were two exclude third:
small number (< 100's) of trials
low risk of false positives
low risk of false of negatives
Choose any two. Accept that the third item will be out of the picture.
I already posted that, in the other thread.
Thor
I get in a sense what you are talking about. For instance you think you hear a difference you do a blind test and get 12 of 20. No where close to only 5% chance your result was random. Do this five times and with 60 of 100 you are within the range to say there is only a 5% chance the result was random. There are sure to be some edge cases where you miss something.
What is often claimed is some highly dubious possibility and the description is "night and day", " you must be deaf if you don't hear this" and the ever popular "my wife heard it from another room". Or even, "the difference is so huge I don't have to match levels". Those guys should not need 100 trials. Yes, maybe in some cases there is something that would show up in 100 trials, but if it goes against rational reasonable knowledge about how things work, it is very unlikely. At a minimum something you need 100 to reach 60 correct and show something is not a "night and day" difference.
Having done a few amateur tests, it has always astounded me how something can seem so certain, so clear, to the point I'd feel like it was blindingly obvious (pun intended) only to have it disappear completely with blinding. I have over time done some of those with recorded digital files to 100 trials. The more I did the closer my results get to exactly 50/50. As real as it seemed there was nothing to it. Of course I've also detected real differences so the test certainly can work.
With the claims made, and experience about such things, and how things work you could hold out hope some of these improbable claims are real. I think the odds are very low. No one can say until such testing is done I suppose.
On the other hand, if it takes 60 of 100 to finally tease out a result, then just how important is the difference in general music playback for enjoyment? 50/50 vs 60/40 must be getting to trivial levels.
I like the way 2AFC testing works, but in audio the listeners always complain about not having a choice that says it sounds the same. No matter how I explain why this will work, and that people have scored beyond chance even when they think they hear no difference, they are never satisfied. I also like triangle testing. Here are three samples two are the same one is different. Listen however you like and pick the odd one out. Up/down testing seems useful for determining thresholds. You have to have a characteristic you know is audible and can be varied however. There are some online blind tests where you can find your personal threshold for THD and IMD. Up down testing to me seems cognitively easier than other types I suppose because you start with something audible and get feedback on the results.
I don't agree that ABX is some evil that is designed to return a null result glossing over real differences. Like anything it can be misused. Like much testing the temptation to use too few trials or samples is there. When done with few trials it is better to confirm differences than to rule them out. You just need more trials. If you can do 10 of 10 and especially 20 of 20 you likely don't need to continue.