And it showed how even the trained Harman testers were strongly biased negatively against!
Not that strongly actually. Here are the results:
Note the exaggeration of vertical scale that starts from 5 and goes to 8. That magnifies the differences beyond what they would normally be.
Notice that speaker T did not change its scores at all. Speakers G and D were similar in rank blind and sighted. And both where better than T either way.
The only one that changed roles was speaker s and that was a small change. From 5.8 to 6.4 or something like that. If you take into account error ranges, then that difference may not even be valid.
You are also mistaken about "trained" listeners being involved in the test. They were not. "Experienced" listeners were used:
What is the definition of experienced? The paper defines the reverse:
Taking listening tests does not make you trained. It just means that you are familiar with the protocol and perhaps are better than others in performing the task as opposed to general population.
I am confident professionally trained listeners would have produced far more consistent results than the "experienced" group did. I know our trained listening panel at Microsoft did. Vast majority of times what our teams found in sighted listening were problems that were confirmed objectively to be there and fixed. Blind tests were only final confirmation.
Mind you, if you can, blind test is better but as we have said, it is not possible to do with headphones. And the DT990 Pro results clearly indicate emulation using another headphone can produce very faulty results since this headphone is not remotely as good as what was assumed. In this case,
there is no doubt whatsoever that a sighted test of this headphone would have produced more reliable results than emulation using a virtual headphone. The aberrations in this headphone are so much in your face that you would lose your "trained" badge if you scored it as anything other than bottom of the pile.