My question here is: for defining something like a Harman curve for the 5128, do you think it would make sense for the GRAS systems to have a "high impedance" and "low impedance" version of the target? Like if we take "low variation" headphones like the HE1000 in the study and a "high variation" headphones like the DCAs or other closed back headphones, EQ them to the same target on the 5128, and then measure the headphones on the GRAS system, one would assume that the GRAS system would show similar behavior under 1 kHz for the "low variation" headphone but more energy in the "Definitely problematic" region. I think somethign like this could be one way to attack the problem of the GRAS systems' behavior in the "Definitely problematic" band, but I guess we would need to measure the output impedance of the headphone to confirm which target to use.
I hope I'm not misunderstanding you, but I am not certain a binary approach like this is ideal given that we're dealing with issues best expressed in degrees ?
Ideally, at least at low frequencies, all publications would do what Rtings is doing (and severely punish headphones that can't deliver a constant bass response), but it's probably not feasible in practice for most. Or perhaps even better, we'd measure headphones on a cohort of dummy heads (for example, 12), that have been evaluated as a good, balanced representation of a larger population of real humans, and evaluate them against the ideal target for each of these dummy heads. But that's even less feasible and might for now be limited to the R&D labs of large companies

.
Otherwise I'd prefer to see GRAS (or 5128 for that matter) measurements supplemented with additional measurements aiming at characterising the headphones' behaviour under various circumstances such as leakage (either a controlled "quantity" of leakage, or a consistent "physical" mod like glasses, or both - for example a pair of headphones might be quite susceptible to leakage effects, and yet have a very good physical / ear pad design that reduces the quantity of leakage present over most heads), pad compression, positional variation, etc. This can help to determine the parts of the spectrum that are consistent under these circumstances and the parts that aren't, and can inform the interpretation of the measurements so that it's not misleading and that the right conclusions are reached : some headphones might not hit the target well, but might turn out to be very stable and desirable platforms for EQ, while others might hit the target well on a given ear simulator, and yet turn out to be poorly engineered because they can't deliver a consistently desirable response (DCA Stealth for example).
Where I can see some value in using two targets for 711 test systems is for IEMs, for active vs passive IEMs, but even then it's fraught with difficulties (in particular since active systems operate over different ranges from one IEM to another and the transition region where the passive behaviour takes over can behave rather oddly in some cases). I've seen a lot of publication test IEMs like the AirPods Pro 2 in a 711 coupler against the Harman IE target (designed using a passive IEM) and this is just plain wrong and misleading.
But I rather think that the 711 standard just isn't well suited to test IEMs IMO given the increasing amount of concordant evidence we have on the fact that it doesn't represent well the behaviour of an average ear (the original impetus of B&K's research that led to the 5128). Harman recently published an article on AES regarding a methodology to estimate the response at the eardrum up to several kHz when using IEMs with an inward facing mic, and the results comparing real ears with 711 and 5128 is another (expected) piece of evidence to that effect :
https://aes2.org/publications/elibrary-page/?id=22943
(to be clear figures 9 and 10 are estimations, not direct in situ measurements, but they are concordant with what we already know, and if I understand the article well enough, the estimation method's error was tested against actual measurements in the two ear simulators). Another point of interest to me here is that we can see one failing of the 5128 fixture when measuring IEMs, it often presents "wiggles" in the lower mids that seem far less prevalent in real ears, and I wish B&K fixed that - for now the only thing we can do is ignore them and "connect the dots" across these wiggles).
The whole thing kind of leaves me scratching my head about why they used the DCA headphones. Even in this study, they observed other headphones to have less variation between fixtures.
In fairness this study shows that one of DCA's headphones could be very consistent, but indeed the X / XO / Noire / Stealth have already been measured as being quite undesirably inconsistent across individuals (and fixtures) and should absolutely not be used in any capacity to translate Harman's work to the 5128 if that's the method one prefers to choose to design a target for the 5128.