If you haven't done so, you need to read my book. I just dipped into this thread after a long absence and this popped up. The resolution of your dilemma is that listeners in these tests show more evidence of responding negatively to flaws, than responding positively to virtues. As you correctly point out, the listeners have no knowledge of how a recording "should" sound. But, it turns out that most listeners have an instinct about recognizing how a recording "should not" sound - responding to the characteristics of reproduced sound that are not "normal" for live sounds.
In the multiple-comparison tests I started in 1966 at the NRCC and that have continued since at Harman (A vs B vs C vs D, not just A vs B) it is easy to recognize and separate the timbral characters of different loudspeakers as distinct from the common factor, the recording, whatever it is. This is why the most revealing program material exhibits wide bandwidth and a dense spectrum (complex instrumentation vs. simplicity; wide bandwidth vs simple spectra like voice or solo instruments).
So, as I say in the book, evidence is that listeners tell us that the highest rated loudspeaker is the least flawed, not the most virtuous, although that is precisely what is meant. Looking at decades of listener response sheets yields enormous volumes of critical comments, some quite colorful, and slim volumes of compliments, mostly versions of "sounds good". Of course subjective reviewers have added to the verbiage with terms that often are meaningless, but poetic. "High resolution" loudspeakers turn out to be the ones with the fewest timbral distractions - it is not an independent variable.
Further analysis showed that the dominant flaw has been resonances, which alter the timbral signature of whatever sound is being reproduced.
The Harman listener training (which you can download and experience yourself) has ONLY to do with recognizing and describing resonances so that useful information can be fed back to the designers, helping them to find and fix audible problems. So, if there is a bias introduced by such training, it is that those trained listeners are very adept at hearing and describing loudspeakers that are not timbrally neutral. Is this a problem? I think not.