Independently quantifying the predictability of listener preference that the rig offers. Is it comparable to or better than established GRAS 45CA and equivalent rigs? I prefer that all focus regarding BK5128 and other IEC 60318-7 rigs will center on this going forward. Prematurely dismissing GRAS 45CA and equivalent rigs isn't productive and goes against reasonable standards of evidence.
I understand that the idea that measuring a pair of headphones on x fixture and deriving a form of predictive score vs a target using a set of factors, or even just eye-balling it, is very seductive, but if you critically engage with the Harman papers, you'll realise that this was actually never really tested except in one of the articles, despite what has been said about that research.
What Harman really tested is "if you apply a number of EQ profiles to a pair of HD800 / HD518 / modded K712 / Momentum IEM, which one is preferred ?".
Now these over-ears were specifically chosen by Harman as in their own tests they proved to be a more stable platform than some other headphones, ex they proved to have a more consistent bass response across individuals. The HD800, in particular, was a very good choice (the HD518 and K712 maybe less so, but the latter was modded to increase clamp force, perhaps this helped). The Momentum in-ear was associated with a MEMs mic to ensure that the seatings in the individual's ears, a major improvement over a lot of previous studies that did not bother to check for seal when using IEMs as test devices.
If the headphones you measure on a GRAS fixture (ideally with Welti pinna, but Amir's rig is mostly rather similar when looking at the big picture) also behave similarly nicely across individuals as the headphones Harman used and are also translating nicely from the rig used to real humans, you can probably be a lot more confident in predicting whether or not a pair of headphones will be preferred over another one. But if the headphones under test are poor performers in that regard and their response varies significantly when the load varies, your capacity to predict preference can rather quickly fall off the window. With the exception of one paper, these HPTF issues weren't involved in Harman's papers as far as their influence on preference, as the "real" headphones weren't even used during the listening tests.
The irony is that for all the talk about how inconsistent headphones are when measured on the 5128 fixture... well that just isn't the case for these rather stable headphones, in fact Amir had very few issues getting consistent traces for these when he tested the 5128. They also tend to produce a response that's not too far off the GRAS rigs anyway up to a few kHz (nothing unexpected here), and they also tend to have a more consistent transfer function between the 5128 and GRAS rigs, which makes a rough translation of the Harman target with that method rather easy anyway up to a few kHz.
The 5128 seems more eager to trigger leakage scenarios than the "flat plate around the pinna" rigs like the 45-CA, no surprise then that leak-intolerant headphones (most closed back HPs are in that basket) proved harder to measure in a consistent fashion. Personally I see this as a benefit as good headphones should be designed to handle worst case scenarios as well as possible anyway (since these leakage effects will be more or less prevalent on real humans, more for some, less for others), but I wish that more publications had continued to use their GRAS rigs alongside the 5128 for over-ears and systematically presented measurements done on both fixtures, it's one proxy way among others to start having some educated guess as to which headphones are more or less likely to avoid introducing undesirable in situ variation across individuals.
For IEMs the fact that the 711 coupler is very much, at best, an outlier, and quite far from the average human ear, starts becoming a pretty big problem when applying the knowledge we've gained from Harman's work to IEMs with a significantly different source impedance, ANC headphones being, in a way, the extreme example of that problem, as the IE target "bakes in" this offset from the average, while feedback systems will more or less nullify the offset. The issue here isn't even a question of preference, it's that the error curve against that target will simply be invalid and mis-represent the actual in situ experience for most individuals (so of course it won't help in making good predictions).
Again, I understand that it's seductive to think that tracing a single line against a target can mean something in terms of predicting users' preferences, but it's only going to have some measure of predictability if the in-situ response is predictable to begin with, and if it's predictable to begin with, then the 5128 fixture won't cause any major issue to get more or less consistent, repeatable traces anyway, and finding a half-decent translation of Harman's target for over-ears isn't a huge challenge either. So the problem of being able to predict preferences rather is a headphones problem, than a fixture problem. Which is why it's very useful to get via any means possible even a rough idea of how a pair of headphones behave when exposed to different loads, leakage scenarios, positions, etc., so that you can know how confident you can be in your predictions.
A fairly extreme example of the fact that the notion of predictability is a headphones problem first, a fixture second, is with advanced active IEMs like the APP2 / APP3 : they will deliver nearly the same SPL and FR curve regardless of whether they're measured in a 711 coupler or a 5128 up to 4-5kHz or so (provided they're primed properly and have the same source and device volume

). For these headphones,
the fixture doesn't even matter to derive predictions. (Well it isn't actually exactly true, there's the issue that the exact same SPL in the 1-5kHz range across individuals likely isn't desirable, something the APP3 (and Bose CustomTune IEMs) seem to try to tackle, but that's worthy of its own thread

- these two companies have long gone past the idea of evaluating headphones in a single ear simulator against a single target and are so far ahead of the level of discussion we're having it's a bit disheartening).
designing a headphone which is more consistent on more heads should be the goal and not that a headphone matches a specific target on one measurement rig.
The DCA Stealth is such a good poster child for this. Measures exceptionally well on a GRAS fixture, has very high inter-individual variation across listeners and an average in situ response that's quite far off what's measured on said rigs. Makes predicting people's preferences for it by plotting its error curve against Harman a crapshoot.