Statistically speaking, those things quickly get tricky. These are very small sample sizes. Plus, what you describe is a perfectly plausible random distribution for multiple tests. People get all carried away on ABX tests and, yes the principle is valid. However, most ABX tests are done on samples that are way too small. It is a better, more grounded, attempt than going full subjective, yes. But it is problematic.
This isn't only for audio btw, it is a recurrent problem.
This paper
https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124 "Why Most Published Research Findings Are False"
made a big splash when it was published and has been cited a ton of times.
Note 1: this isn't meant as a criticism of your method or results. I would do the same because I don't have the patience or the focus to collect enough data.
Note 2: if the effect is strong - in audio, say a clear distortion is heard and recognized with high accuracy - small n results may be valid. Otherwise, you really need a relatively large number of tests for every specific point you want to test (around >30, ballpark figure I've kept in mind from my stats teachers) unless you want to go with fancy distributions.
Note 3: multiple distinct tests (again, unless you have identified a very specific characteristic and you test for that characteristic in these tests) increase the chance of finding a significant result just by chance, but it will be _any_ significant result... Drug companies love those