What's most interesting to me about this paper is that some headphones are very consistent between the two HATS measurement fixtures while others—namely, the two used as the "playback" headphones in this study—seemingly vary a lot. Again I'm not super well-versed in this stuff, the last few months have been a rollercoaster of reading and learning for me, but if the more accurate ear is showing a big change vs. the older system, this strikes me as an indicator of bad design because the headphone should aim to be more consistent regardless of the head it's placed on, right?
This is the most important take away info here indeed.
That the transfer function between different fixtures is inconsistent between different headphones, and that as a result finding a transposition for the Harman target to the 5128 by using the average transfer function of a cohort of improperly selected headphones can be quite problematic is nothing new, Sean Olive among others made a few presentations on that subject a while back. The box plot graphs are just the logical conclusion of that problem.
Now whether it's between different fixtures, or different individuals, we should expect a desirable target to vary at the eardrum between them, as different anatomies impact the resulting SPL at the eardrum, which is why it's probably a good way to distinguish between what belongs the "desirable" inter-individual variation and the "non desirable" one (I guess you've already read about HRTF maps for example ?).
That some headphones have a tendency to produce more undesirable variation is well known and well characterised already - for example Rtings has been measuring bass response on real humans for years already, and we have plenty of articles on AES on the subject (including from Harman). This is a part of the spectrum where an individual's anatomy should have a lot less influence over what a desirable target is than at higher frequencies.
It's harder to know whether the variation observed at higher frequencies is of the desirable type or not, this is an area where there isn't a lot of publications. But what can be done in that range rather easily is to test a pair of headphones on the same fixture for coupling effects (ex by compressing the pads). If the resulting frequency changes a lot in specific bands, it's already a pretty good indication that these headphones are likely to exhibit a lot of undesirable inter-individual variation since they can't even deliver a constant response on the same fixture. It would be better to determine what the ideal inter-individual variation should be for a cohort of individuals, then measure in situ a number of headphones and plot which ones seem to translate from one individual to another in a more desirable way, but this seems quite hard to do well and I am not aware of a lot of publications on that subject.
Now as we've discussed a while back in this forum, it's actually not that hard to find a good translation between Harman's rig and the 5128 provided one uses, as you pointed out,
only headphones that we know are likely (at least up to a few kHz), to translate well from one fixture to another in a desirable fashion. And if headphones that can't do so then show a "different difference" between their response and the target obtained, it should be considered a sign that the headphones' engineering leaves something to be desired, and neither be an indictment against a particular fixture nor the sign that getting a half-decent target translation, at least up to around 3-4kHz, is impossible.
Basically, the problem here is not the 5128, it's the DCA XO :
Active systems used in some headphones are soon going to make this a moot point anyway. Feedback systems used in ANC headphones already compensate to some degree for inconsistencies in fit / leakage / impedance because they are, essentially, error loops (the better designed ANC closed back over-ears tend to over perform by a wide margin passive closed backs in Rtings' consistency measurements for example), but they were traditionally limited to frequencies below 800Hz, and in some situations (like IEMs) were a double edged sword. Companies like Bose and Apple - and possibly others, and I hope that Harman will soon join them (they've started to publish articles to that effect), have already managed to extend the range over which they can use active systems to predict the in situ response at the eardrum up to several kHz, but for now I believe that it's limited to IEMs (I have not been able to measure the Bose CustomTune over-ears to understand what they're doing), and while I can test to some extent whether or not these systems are successful with some DIY experiments, I can't make a comprehensive evaluation, so some doubt remains whether or not they've done a good job.
If indeed the prediction is correct, designing targets for different fixtures becomes increasingly easy as the HPTF effects are minimised, but anatomically derived "desirable" differences may still require a different target otherwise (among other factors that may still call for different targets).