I still have a lot of catching up on this fast moving thread, but I have to point out that not what statistical significance means.
Clearly, if you got 10 out of 10, it's not 100% certainty that you did not guess. If everyone on this board, each morning, took out a coin and called "heads", then flipped it 10 times, at some point we'd have someone with all 10 tosses "heads". Maybe even on the first day. It doesn't mean they guessed correctly, they almost certainly didn't. But, if we polled everyone's results each day, we'd expect a Gaussian distribution.
I find it common to misunderstand what confidence mean. A friend reported getting 20 of 30 listening tests correct, he said that was very near the "golden" 95% confidence (he said he didn't want to test any more and ruin his score). In this case, it was a test of hearing 24-bit dithered versus truncated audio. More recently, I gave him a tone test at receding levels, which he found difficult to hear beyond -90 dBFS, which further casts doubt on his (anyone's, of course) ability to hear ~-140 dB in the presence of normal music levels.
So, yes, 9 out of 10 is very attention-getting—I agree that "the 9/10 result shouldn't be easily dismissed as a simple fluke". The problem is, it's not conclusive either.
Which brings up a second problem with these kinds of tests. If we're giving such a test to a large group to find out what the general ability of a population is, then it makes sense to give everyone 10 tries. But if we're trying to find out whether you can really pick the right choice, we need more than 10.