Another interesting aspect was that experienced listeners were almost completely unable to differentiate between Headphone A equalized to sound like Headphone B, versus just Headphone B (with the rare exception of when either headphone might exhibit a significant amount of distortion.) This implies that cup reflections and design really might not play as big of a role as we like to think, and audiophiles might not be as good at distinguishing non-FR aspects as they like to think.
I think that this might not be a 100% accurate representation of the related papers, which I'm assume are :
https://www.aes.org/e-lib/browse.cfm?elib=16874
and :
https://www.aes.org/e-lib/browse.cfm?elib=18462
and possibly :
https://www.aes.org/e-lib/browse.cfm?elib=17441
From what I understand so far, in the first article (over-ears), the virtual headphones were tested against the real headphones directly on the subjects' heads.
But in the second paper (in ears), the virtual headphones were tested against
recordings of the real headphones, made on the GRAS fixture. So in the second case the question of HPTF variability or measurement inaccuracies is lost and the two headphones (real vs. virtualised) were
guaranteed to have nearly the same FR (which is definitely what you want to do in the third paper where you need to make
sure that FR is a controlled variable). Which is possibly why a better correlation was found for the in ear virtualisation vs. the over ears.
In the over-ears virtualisation article the preference order was actually different between the real vs virtualised HPs. For one HP (HP1) the article attributes the difference to sensory biases, for another one to leakage effects (which is an important part of HPTF variability that Harman evaluated in another article -
in which case it's already well established that design matters in a big f*cking way). No explanation was given for HP5's significantly different score I believe. But I'm not certain that all nuisance variables were explored (HP1's perceived spectral balance was a lot better in the real test than in the virtualised one for example).
In all cases the question of HPTF variability or measurement inaccuracies is not enough to invalidate the idea that Harman's targets tend to be generally preferred over other targets, far from it, as in the 2013 article (
https://www.aes.org/tmpFiles/elib/20210917/16768.pdf) the "RR_G" target was tested against the "DF_M", DF_MH" (two different diffuse field targets), and "FF" (free field) on both the LCD-2 and HD518 (on the GRAS rig these target should measure identically on both headphones) and the order of preference remained similar. But interestingly the delta between the targets' score was quite different : in the LCD-2 test the RR_G target was only slightly preferred over the DF_M target, the latter being strongly preferred over the DF_MH target, while in the HD518 test the RR_G target was strongly preferred over the DF_M target, and the latter only lightly preferred over the DF_MH target - even though on a GRAS rig they all measured the same and therefore should have logically scored consistently. I'm not certain that this discrepancy can solely be attributed to HPTF variations / measurement inaccuracies pushing the targets above / below the GRAS response if they were measured at the listeners' DRP, as the two tests included additional, different targets that may have biased the way the subjects comparatively rated the constant targets (addition of "RR1_G" in the LCD-2 test, "DF_L" in the HD518 test, and the "no EQ" target - obviously different for each HP - in both cases). But it could be.
My understanding so far is that I think that the question of HPTF variability / measurement inaccuracies may introduce a grey area where the predictive value of Harman's research is less valid. It's not a big deal when even as of 2021 HP manufacturers still sell crap like this :
https://headphonetestlab.co.uk/test-results-manufacturers-a-d-bw-px7
But it might make it difficult to make preference predictions for two HPs that measure similarly or score decently well (example : HD560S vs HD650).
The extent of that grey area may not be that easy to determine. I don't know of that many articles that explored the question of HPTF variability / measurement inaccuracies. It's not easy to measure the actual on-head response of a pair of HPs past 1kHz or so. I believe that these explore these issues :
https://www.aes.org/e-lib/browse.cfm?elib=17699
https://www.aes.org/e-lib/browse.cfm?elib=17242
https://www.aes.org/e-lib/browse.cfm?elib=16877
At lower frequencies for some HPs types it can be very significant. That's been extensively studied and is one of Rtings' most interesting set of measurements IMO :
https://www.rtings.com/headphones/tests/sound-quality/frequency-response-consistency
Between let's say 400-1kHz it's probably quite minimal.
Above 5kHz... difficult to know for sure but probably quite important.
And I'm starting to think that in the ear canal gain region we don't have the last word for it yet (particularly I'm starting to feel for ANC over-ears with a robust ANC / feedback mechanism such as the Bose 700 or Airpods Max, which are already quite difficult to accurately measure for various reasons).
My understanding is that some of these variations may be desirable if they correspond to what your individual HRTFs would result in if you were to take the place of the GRAS' mannequin in Harman's "decent room with decent speakers", but some aren't. Again difficult to know the extent of that.
So while Harman's research tends to give credence that once FR is a controlled variable, provided distortion is low enough, HPs are similarly rated, I'm not certain that FR remains a sufficiently well controlled variable in terms of actual on-head response that it makes questions of "cup reflections and design" not that important.
My understanding of these papers may be wrong so feel free to correct me

.