For audio reviewers who claim the differences are obvious, your point is valid. But I'm talking about a different case. Sometimes when I listen to differences, they are so subtle I'm not sure if they are actually there. I want to know whether they are real, or just my imagination. Also, companies building audio gear, engineers designing codecs, need to test near the limits of perception. This requires a lot of trials, so I would imagine they use techniques to mitigate listener fatigue, or have a mathematically valid way to aggregate shorter trials done on different days. Either would be interesting to share here.
PS: for example, consider the following set of tests, each conducted on different days:
Test 1: 7 trials, 5 correct, 77.34% confidence
Test 2: 8 trials, 5 correct, 63.67% confidence
Test 3: 6 trials, 4 correct, 65.63% confidence
Test 4: 9 trials, 6 correct, 74.61% confidence
None reached 95% confidence. Can we simply sum them? If so it's 30 trials, 20 correct which is 95.06% confidence.
Intuitively, if you do only slightly better than random guessing on a short test, it might just be luck. But if you do only slightly better than random guessing every time, consistently, repeatedly, then you can still reach high confidence with enough trials.