As you wish.
You're correct that the results are highly variable in situ, but the same can of course be said of an unwindowed in-room measurement of a speaker at an arbitrary position. The different with speakers is that we've got a fairly nifty toolkit for divorcing ourselves from the reality of the playback situation for practical measurements, whereas with headphones, you must take the whole kit and caboodle as well. This reflects the reality of headphones on heads, however -
the audible effects of positional variation in headphones are meaningful, just as moving around within a listening room is meaningful to our subjective perceptions.
This is where you're going off the reservation. The inverse HRTF filtering process we apply to speakers we apply as well to headphones (albeit we don't use the inverse filter set for a frontal sound source; see Theile 1986, 2016, etc), and the same rules for timbral accuracy apply: we need eardrum sound power to approximate the listener's head in the sound field of the perceived acoustic source (arguably a diffuse field, see Theile if you care but honestly nobody does these days) to perceive the sound as "correct", or "flat", or what have you.
In a scenario where HRTF varied wildly - which, to some degree, it does, at least at some frequencies - and the response at the eardrum of headphones was absolutely constant across users, you would be quite correct that we would be unable to predict subjective tonality above perhaps 2khz or so. In such a world, each headphone would effectively be different to each user because it was the same and the users differed. I've spent a fair amount of time measuring headphones on different pinnae, however, and I simply do not see this effect - there are meaningful differences between the same headphone measured on two different sets of ears, and two different headphones on two sets of ears tend to differ in similar ways, at least for the circumaural designs that make up the majority of the high-end market. I'll freely acknowledge this as a caveat of any design that bypasses the ear anatomy - and IMO it contributes to the higher variation of preference in in-ear designs, which bypass the ear almost entirely - but it's my opinion that you just can't reconcile the degree of skepticism you're voicing with the degree of consistency in ex. Sean Olive's correlation of headphone response on a mannequin to subjective assessments.