The first and foremost over-simplification is to assume that the FR that's been measured on an ear simulator is the one your own sample will deliver on your head.
By doing so you've just got rid of "essential complexity", and one that's already been largely demonstrated.
For the over-ears :
View attachment 213171
OE1 is the Meze Elite and OE3 (ANC ON and OFF) is the Airpods Max. For these it's fairly easy to know (the kinks in the FR for the Elite and the shape of the FR in the feedback range for the APM).
OE2 : I'm going to guess that it's an Audeze planar, possibly the LCD-5 given the shape of the 1-4kHz region
OE4 I'm less certain, but I'm tempted to think that it might be the K361
OE5 also is more difficult to decipher, perhaps the DCA Stealth ?
MDAQS suggest that the Airpods Max are the best over-ears headphones among the lot and that the Meze Elite and Audeze planar (LCD-5 ?) suck ass in comparison :
View attachment 213172
This is another video that provides a little more details :
Unfortunately that's still nowhere near enough what's needed to actually know anything remotely enough about the methodology that they used. I have a
lot of unanswered questions about it.
As it stands, given how nebulous it currently is, it's a completely unreproducible piece of research.
For a start I'd like to know more about the systems they used during the subjective listening tests (a photo in the presentation above suggests that at least for one session HD6... were used, but were they compensated ? How ? To which target ?), how exactly the binaural samples used were recorded, how exactly each sub-categories such as "spectral flux" are measured, and the individual results for each one of them.
And the videos already contain some rather obvious logical fallacies, such as at 24:27 in ADU's first video, where the narrator says that “not everything is determined by frequency response” (which might be the case, or not) on the basis that two different headphones with a markedly different FR still received the same MDAQS scores (including in all three sub-categories, and including "timbre"). This is fallacious as same score only means that according to their predictive model they’d be equally preferred, but that also means that they might simply be equally diverging from ideal, and it still can't be excluded from that slide that this ideal could be reached by altering the FR alone.
Also, MDAQS is designed to be applied to various reproduction systems, from headphones to car audio. If each block contains several different attributes (and all blocks, BTW, include FR), it doesn’t necessarily mean that for headphones all attributes are meaningfully relevant. Ex, in the “timbre” attribute, it could very well be the case that the “spectral flux” (what does that mean ?) is correlated to the FR attribute and that, for example, equalising headphones to the same target also results in the same “spectral flux” results (while that may not be the case for loudspeakers). We'll need a lot more published data to know that.
Without additional public data, for me, as it stands MDAQS is meaningless.