You could say I went off the boil towards the end..........or I wzs guessing well at the beginning
There is simply a problem regarding statistical power and the according sample size because that depends on the detection abilities of the listener (detection ability under the specific test conditions) and the effect size.
Foobar let you choose your own decision criterion, but doesn´t supply any information about the dependencies.
Let´s say that you choose the traditional/usual 0.05 criterion than you´ll take any result with a probablity of p < 0.05 (means the probabilit to reach a result by pure random guessing) as evidence that you could hear a difference. So the error risk that you would accept a result as evidence although it could have been due to random guessing is 5% (in the long run).
Statistical power describes the strength of your test to detect a difference if there is a detectable difference (more precise to reject the null hypothesis correctly, because it is false).
So if you want to balance both error risks you should aim for a statistical power of 0.95 and now the effect size/detection ability determines the sample size needed to reach those numbers for both error risks.
Let´s say for example that the effect size is 0.2 which means your actual p is 0.7 instead of p = 0.5 as assumed under the null hypothesis, than you´d need a sample size of 67 trials.
If you´d aim for the minimum statistical power that is nowadays commonly used (1-beta) = 0.8 you´d still need 37 trials at least.
If the effect size or your detection ability is only at p = 0.6 you´d need even 158 trials (same conditions, power = 0.8, alpha risk = 0.05). All numbers so far calculated for an one-sided test.
If we calculate the achieved statistical power from the example above (assumed actual p = 0.7, 16 trials, alpha = 0.05) we get the result:
power = (1-beta) = 0.445 , so the probability to get a false negative is 44.5 %
If we calculate the same for an assume actual p = 0.6, we get:
power = (1-beta) = 0.167, so the probability to get a false negative is even higher at ~83%