How does that invalidate the listener's opinion?
Because it is trivial to have an opinion about anything that plays music. It is no different than you saying it will rain tomorrow here. That you have an opinion is not of any value. What has value is if you are right vast majority of time. Otherwise it is a guess and who cares about a guess?
A test like this needs to be administered by a third party, not the person himself scoring himself. You play A and B and see which one the tester says sounds better. You repeat this a dozen times with sequence randomized. Then you perform a statistical analysis to see if the outcome is random or has better than 95% chance of being correct. This is why every controlled test that is published has a ton of statistical analysis. If that is missing, you need run, run far away from anything the person says.
Here is me passing an MP3 ABX test:
foo_abx 1.3.4 report
foobar2000 v1.3.2
2014/07/19 19:45:33
File A: C:\Users\Amir\Music\Arnys Filter Test\keys jangling 16 44.wav
File B: C:\Users\Amir\Music\Arnys Filter Test\keys jangling 16 44_01.mp3
19:45:33 : Test started.
19:46:21 : 01/01 50.0%
19:46:35 : 02/02 25.0%
19:46:49 : 02/03 50.0%
19:47:03 : 03/04 31.3%
19:47:13 : 04/05 18.8%
19:47:27 : 05/06 10.9%
19:47:38 : 06/07 6.3%
19:47:46 : 07/08 3.5%
19:48:01 : 08/09 2.0%
19:48:19 : 09/10 1.1%
19:48:31 : 10/11 0.6%
19:48:45 : 11/12 0.3%
19:48:58 : 12/13 0.2%
19:49:11 : 13/14 0.1%
19:49:28 : 14/15 0.0%
19:49:52 : 15/16 0.0%
19:49:56 : Test finished.
----------
Total: 15/16 (0.0%)
See the statistical analysis of 0.0% probability of being wrong? This will tell you that my outcome has very high chance of being reliable.
Here is a counter example of me seeing if I can reliably detect a "grounding box" being attached to the system:
foo_abx 1.3.4 report
foobar2000 v1.3.2
2016/02/14 08:50:25
File A: C:\Users\Amir\Documents\Test Music\Entreq 2 digital\test_4_output_entreq.wav
File B: C:\Users\Amir\Documents\Test Music\Entreq 2 digital\test_4_output_no_entreq.wav
08:50:25 : Test started.
08:52:22 : 01/01 50.0%
08:52:30 : 01/02 75.0%
08:52:43 : 02/03 50.0%
08:52:51 : 02/04 68.8%
08:53:03 : 02/05 81.3%
08:53:32 : 02/06 89.1%
08:53:58 : 03/07 77.3%
08:54:12 : 03/08 85.5%
08:54:27 : 03/09 91.0%
08:54:31 : Test finished.
----------
Total: 3/9 (91.0%)
I got 3 answers right so you may think I actually "heard" what the device did. Quick statistical analysis shows that by missing 6 other trials, there is 91% chance that I was guessing. In this instance, your conclusion should be that I did not detect this box being there. Not that "I have an opinion that should be listened to."
This type of test should have had two phases: phase 1 would be if any difference is reliably detected per above. Once that is established, then a second test for preference would then be performed.
There is just nothing here that is remotely reliable. Audiophiles routinely think A sounds better than B even though we can prove there is no difference in sound waves coming out of a device. The he didn't know A and B's identity is not important in this context.