Good observation, but what do you mean exactly by "saturation and sharpness"?
The issue with strong peaks of whatever Q is that they may be so prominant that they overshoot the mean loudness of their band. As a consequence you may have to turn down the volume which makes the peak more bearable but further attenuates the neighbouring frequencies which finally leads to a loss of detail. To make it short, it's just a matter of missing balance in the higher frequency response.
It would need to be proven that the difference between couplers (or even heads, in a large-scale test) is measurably high as to be regarded as important in measuring treble response.
The deviations induced by pinnas and ear canals of different shapes and sizes have been sufficiently investigated in the past fifty years.
Hammershøi and Møller, Sound transmission to and within the human ear canal (1996):


1st graph: free field response at the ear entrance point = transfer function of the pinna
2nd graph: free field response at the drum reference point = transfer function of pinna + ear canal
3rd graph: response at the drum with the source at the ear entrance = transfer function of the ear canal
We can say that such resonances for sure
DO matter when we are talking about
accurate sound reproduction.
However, that does not mean that an average approximation like the curve gathered by Harman does not serve a purpose. In fact, headphones voiced towards an average in-room preference are still way better than an arbitrary curve that has been colored by an individual taste and ideology. "Accuracy" is still another thing, though.
If we're really speaking about completely unaccurate response in >4-5kHz range then I'd say that it's absolutely pointless to bother discussing in that case, since then we'd be completely in the dark if using actual measurements. Yet, I believe on GRAS we're measuring the summarical (so we can omit predicting the actual resonances happening, since we're measuring them in anyways) response and it's discussable whether this should even get mentioned.
...
It seems that Harman found out that until 8-10kHz the difference is consistent enough to take it into account and after that there's no point EQ-ing.
The older 60318-4 ear simulators are rated with a smooth, consistent response until 10 kHz. Everything above is quite uncertain due to the high-Q tube resonance which is specified at 13.5 kHz (+/- 1.5 kHz). The resonance was measured in a closed system with the mic right at the entrance of the coupler. Now, when we attach an open ear canal to that (which is the standard for the newer measurement fixtures) the simulator turns into an open tube resonator with much wider bandwith and one or two distinct spikes at certain frequencies. The position of those is prescribed by the shape and length of the tube.
The resultant problem with that is that the measured response can get really messy when the self-resonances of the ear simulator are overlapping the resonances of the headphone driver and cavity as much as the interferences induced by the silicone pinna. Without further insight, it is very hard to isolate and judge the actual performance of the headphone because you only receive the total sum of all the components interfering with each other.
Some people may say that this totally fine since those fixtures are derived from average models and thus do actually reflect the average listening experience in total. And indeed, as long as you don't look too close and apply a certain amount of smoothing (the Harman response is heavily smoothed!) such measurements can actually be quite usable. However, keep in mind that YOUR ears are not the ones that were measured in the chain! Your ear canal length, diameter and bending might and actually will differ from that of a reference called by GRAS or B&K, so will the size and shape of the pinna.
I can only point to the research done by Hammershoi & Moller, David Griesinger and Paul Barton that shows how complex and diverse both the human anatomy as much as the auditory perception actually are.
To come back to the later point, which is the Harman research, I don't know if they already had the newer high-res pinna and coupler (ear simulator, to be exact) when they did their latest investigations. Essentially, the pinna got an anthropometric ear canal with the first and second bend (averaged from a lot of 3D scans). Moreover, they damped the self-resonance of the 60318-4 coupler which now offers much smoother responses with less inteference issues.
Still, whatever those components spit out as a final plot, there is still the final question about the anatomy of the individual listener in the "non-average" practice. The improvements described above might decrease tolerances and improve the repeatablity of the measurements. And that's a good thing without any doubt. But they tell only little about the exact sound reproduction at the listener's ear drum which would be essential when we wanted to analyze the level of detail a headphone could produce.
After all, it is a very individual thing. That's why generalized measurement systems hit their limits sooner or later.
you could also do something like what Oluv's Gadgets does - get really good recordings of the output of different headphones and then A/B and try and pick the "higher detail-retrieval" phones...
As much as I appreciate Oluv's efforts in HiFi, his procedure to track and simulate the coloration of a headphone with a miniDSP EARS is totally flawed since ...
A) we have no data that confirm that the impedance of the whole system is anywhere near that of a real human ear, let alone an average
b) the rig is technically not even an ear simulator, just an electret microphone (with more or less unkown calibration) stuck into a rather hard silicon pinna
c) you are recording a transducer with a transducer, then playing back the whole thing with just another transducer and expect the outcome to be anywhere near the original. Every component in the row adds its own artifacts and distortion.
Regards
Dreyfus
PS: Sorry for the wall of text. But I think you really need the context to understand why such systems have certain practical limits and should not be taken as incontestable standards.