So blind testing is like a one-way door. Positive results tell you something definitive, within the confidence range that you can control and measure. But negative results tell you nothing; more precisely, that whatever difference you were testing was not obvious enough to be detectable. It doesn't mean the difference doesn't exist, nor does it mean you can't hear it. A negative test result simply doesn't tell you one way or the other.
I was with you till you said it tells you nothing.
The issues you state are true but we have tools to deal with them. Best description is the international standard, ITU BS1116,
Methods for the subjective assessment of small impairments in audio systems
The document is quite approachable so I suggest reading it. For now, the set of tools we use to increase detection are:
1. Pre-introduction to the test. Listeners can as a group evaluate the content, test setup, etc. all sighted (or blind) to become familiar with the content, methods and potential audible differences. They are free to do as much as they like.
2. Controls. Positive control are included to a) assure test is good and b) weed out poor listeners. For example, in a test of transparency of MP3 at 320 kbps versus CD, we include a 64 kbps MP3. The latter has narrower frequency response and much more artifacts. If someone misses that, they will be excluded from the test.
3. Listener training. It is appropriate and indeed recommended that trained listeners be used where possible. Training involves hearing the most extreme version of the distortion under test, and then gradually lowering the level of impairment. After practice, such listeners can reliably tell differences that escape even the most ardent audiophiles.
4. Keeping the length of test short. Trained listeners can be used as samples in advance of the test to see whether they can tolerate the length of the test.
5. Memory aids. Blind testing tools need to have controls where small sections of music can be looped with instantaneous switching between inputs. Short-term memory is very accurate and using this technique, we can put it in control rather than the sloppy long term memory. In my testing of very small impairments, at times I identify a note as short of half a second that I loop to find differences.
Note that such a control does not exist in typical evaluation that audiophiles do. They rely on very long term memory (e.g. listening one song and then repeating) which completely blows away any chance they have of telling small differences.
6. Statistical analysis to catch people who are randomly guessing. Their results can be excluded from the final results if needed.
Following this system as international organizations that develop audio technology do, the level of acuity, i.e. false negative as you call it, is far, far better than mass public and mass audiophiles. See this published test on this:
https://www.audiosciencereview.com/...ity-and-reliability-of-abx-blind-testing.186/
So while it is true that no test is perfect, and it is abundantly easy to get negative (or positive results), we do know how to properly conduct such tests so that the results are pretty defensible. Once we combine multiple such tests then we build up confidence in results.