Since in any such “forced choice” test the listener can simply guess, a protocol is used to differentiate between that and actual fidelity difference. The statistical method calls for determining the probability of chance being lower than 5% that the person was guessing randomly. Or put inversely that there is 95% confidence in the person hearing an audible difference between A and B.
The headline from the report was that out of all the trials, the number of correct guesses was only 49.82% (why anyone would report such numbers with two decimal places where the margin of error is quite a bit bigger than this is beyond me). The good enough camp happily runs with this summary by declaring that the difference between DVD-A/SACD and its “CD” version was no better than chance, ergo there is no audible difference.
Lost in that is one tester who managed to get 8 out of 10 right meaning there was 94.5% probability that he was identifying the proper source and not guessing. This is so close to 95% threshold that it should have been noted as significant and countering the larger conclusion but was not. Two other testers managed 7 out of 10 correct selections. These were all dismissed as exceptions and the total number of trials/listeners incorrectly relied upon.
I think this is a common misunderstanding of what 95% confidence interval means. If we want to know whether a coin has a preference as to which side it lands on, and give a thousand people a shot at 20 coin flips, if one person gets 19 of out 20 heads, if does not mean we're 95% sure the coin had a preference for heads for that one person. The 95% confidence interval is to allow assumptions to be made for broader populations, as to where the mean lies. You can't really pick out one person from the group. If you do, you'll have to test them multiple times to have confidence it's repeatable.
An easy way to understand why 95% doesn't mean 95% confidence, note it's true is you can get 100% purely by chance—rare, but it happens; if someone got 100% of their coin tosses while "willing" heads, would you say that you're 100% confident they could will the outcome?
As for "7 out of 10", note that chance distribution is a bell curve, so 7 of 10 is uninteresting, it's buried in the meat of the curve. I just tossed a penny, 5 sets of 10: 6/4, 6/4, 5/5, 2/8, 3/7. 80% and 70% in there, totally by chance and with few attempts. The total result, 56% tails on 50 tosses, and we know this would tend towards 50% with more tosses, yet I already have 80% on an individual trial, and there is a possibility I'd get 100%. But the only thing that can matter is the long-term outcome.
But to your greater point, I think, I agree that simply doing test like ABX with arbitrary people doesn't always answer the question we have in mind. For instance, if the question is, "can a person discern trumpet notes in a piece of music with many instruments?", and we use subjects from the general population to hear two clips of music and choose A or B as the clip with trumpet, everyone here is probably going to get 100%, but a typical person might have to guess at all of them. They'd get 50% as a whole, but some would be lucky and get 80%, some unlucky and get 20%—some lucky guy might totally guess 100%. There are certain things you can and can't answer with this hypothetical study. If you have enough of "us", you might be able to answer "can some people hear...", with some confidence, but you couldn't answer "can all people hear...". You can't even be certain the individuals who score 95% and up could hear it, but if enough did you'd probably have a good idea. At that point, you'd probably want to take the ones that got high scores and repeat testing—some would fall out as lucky, other as being highly likely to hear it.
And that's the problem I have with a lot of "can people hear the difference between 24-bit and 16-bit" kinds of tests. There is a huge difference between asking if people in general can hear, and whether some people can consistently hear it and prefer one over the other.