I don't know what you mean by "observation". If by that you mean actually listening to headphones, well, the value of dummy heads measurements for me is precisely to not bother with headphones which exhibit some deficiencies in their FR
of a certain nature to such a degree that I can very effectively correlate these deficiencies with me not liking how they sound.
The 1266 is one of those - particularly since several copies have been measured by different people with similar results (Crinacle).
Another, good example would be the B&W PX7 (turns out that I did listen to these and my ears still haven't forgotten me).
The deficiencies of both of these are of such a nature and degree that Harman's research would probably suggest that they'd loose to more preferable FR curves in blind tests for most people.
Once measurements pass the "good enough" threshold for me is when I want to listen to them before expressing a preference for one or another. Typical headphones of that kind would be the K371, HD650, HD560S, AirPods Max, Sundara, Empyrean, etc. It doesn't mean that I'll like all of them, far from it.
The gist of the idea is that all squares are rectangles, but not all rectangles are squares. Ie all the headphones I've enjoyed and kept in the long run measure reasonably well on a dummy head, but I don't necessarily enjoy all headphones that measure reasonably well in these conditions.
Most headphones would actually land in the "meh not a priority" category based on their FR measurements and if I'm kind the Diana V2 would barely get in it.
I would also suggest not relying one someone else's subjective observations for the most part as headphones' FR can vary quite a bit from person to person if it were to be measured on their own head, particularly at the two extremes of the spectrum. Here an example of how the HD820 varies on five different humans below a few hundred Hz :
https://www.rtings.com/headphones/1-3-1/graph#669/3185
And the methodology :
https://www.rtings.com/headphones/tests/sound-quality/frequency-response-consistency
The reason I can easily pass judgment on the 1266 is because it show gross deficiencies in the area where these variations will be the lowest.