Uh-huh, but if it's so close to the threshold of perception that it requires trial after trial to try and identify in a blind test, how is it even in the realm of possibility that it's a difference notable in the sort of comparisons the average subjective review undertakes? Like...how could something so subtle be reliably identified in a comparison between this new dac I just bought and the one I had before that I last listened to a minute ago or a couple days or a couple weeks ago?
For audio reviewers who claim the differences are obvious, your point is valid. But I'm talking about a different case. Sometimes when I listen to differences, they are so subtle I'm not sure if they are actually there. I want to know whether they are real, or just my imagination. Also, companies building audio gear, engineers designing codecs, need to test near the limits of perception. This requires a lot of trials, so I would imagine they use techniques to mitigate listener fatigue, or have a mathematically valid way to aggregate shorter trials done on different days. Either would be interesting to share here.
PS: for example, consider the following set of tests, each conducted on different days:
Test 1: 7 trials, 5 correct, 77.34% confidence
Test 2: 8 trials, 5 correct, 63.67% confidence
Test 3: 6 trials, 4 correct, 65.63% confidence
Test 4: 9 trials, 6 correct, 74.61% confidence
None reached 95% confidence. Can we simply sum them? If so it's 30 trials, 20 correct which is 95.06% confidence.
Intuitively, if you do only slightly better than random guessing on a short test, it might just be luck. But if you do only slightly better than random guessing every time, consistently, repeatedly, then you can still reach high confidence with enough trials.