No speakers are used in this test. Just high-end planar headphones.
When the only difference in a test is an amplifier which has the exact same (in this case purely resistive) load and the headphone isn't even moved, the amplifiers in question provide the exact same volume, the used auditory system is the same one and time is spread over the test (AB switching several times) then the only variable in that specific test is the used amplifier.
It really does not matter how 'wonky' the FR is because it will be equally wonky on both amps during evaluation.
So when he 'evaluates' bass, mids, treble, cleanness, bass impact, air, imaging even PRaT
and he consistently rates it differently for both amps it is basically the same as comparing A to B at the moment of switching. The headphone is out of the equation.
I get your point when you would have him describe the sonic signatures of both amps and then compare it to that of others using different headphones. That's not the test so irrelevant.
@BrEpBrEpBrEpBrEp suggestion to compare notes of perceived SQ determined statistically often enough when not knowing is accurate enough.