I'm aware of it, but the simple truth is that contradicts Olive's scoring formula.
Does it, though?
@Floyd Toole is often quoted as saying "if the direct sound is wrong, nothing else matters". You look at the Olive formula, see that it seems to prioritize PIR over ON, and you conclude that Olive contradicts Toole. I don't think that reasoning holds, for several reasons.
First, it's difficult for a statistical model based on PCA to encode something like "if variable X is bad, variable Y doesn't matter". Indeed that would imply a non-linear relationship between variables, something PCA can't deal with. So the modelling process couldn't come up with something like that even if it tried. Add that to the (already quite large) bucket of "possible reasons why the Olive model doesn't achieve a correlation closer to 1".
Second, the assessment that PIR is given a higher weight in the model compared to ON is based on the fact that the combined weights of NBD_PIR and SM_PIR are higher than NBD_ON. The problem is that the weird behaviour of SM throws a wrench into that reasoning. As has been discussed multiple times in the past, SM tends to be anti correlated with NBD because, if the slope increases, then SM improves, but NBD worsens, and vice-versa. This leads to the conclusion that a speaker cannot simultaneously achieve a perfect SM_PIR and NBD_PIR at the same time (unless you're gaming the model and doing something absurd like a perfectly flat PIR with an infinitesimal slope - which, by the way, is another reason why using the model as a target could give "surprising" results). Because NBD_PIR+SM_PIR can't be maxed out, a statement like "NBD_PIR+SM_PIR has higher weight in the model than NBD_ON" is questionable at best.
Third, another problem is that, in a typical non-EQ'd speaker, deviations in PIR and ON are somewhat correlated with each other. And not just because PIR includes ON in its spatial average (I agree that's negligible), but also because many defects in speaker frequency response will create similar deviations in both PIR and ON (resonances are the textbook example). And remember: we are dealing with a statistical model that only cares about correlation, not causation. This means that when model gives weight to PIR, it also gives weight to ON at the same time, because PIR is not independent from ON. This once again makes the statement "model gives more weight to PIR than ON, therefore PIR is more important than ON" highly suspect. Because PIR is influenced by ON, it could very well be the case than the total influence of ON in the model is higher than PIR when the model is applied on a typical speaker. By the way, this assumption that deviations in ON are correlated to deviations in PIR tends to break down when EQ is applied, because EQ can easily remove such "omnidirectional" deviations. This likely means the behaviour of an EQ'd speaker is not "typical" behaviour as far as the model is concerned, which brings me to my next point…
Four, again, the model is calibrated based on typical speakers. Most speakers (especially passive speakers) are not factory EQ'd, and that was even more true back when Olive generated his model. Non-EQ'd speaker design involves different tradeoffs than EQ'd speaker design. For example, if one doesn't use EQ, a speaker designer might be tempted to leave in a frequency response dip on the grounds that it would be hard to fix and is not very audible anyway. But Olive's NBD metric doesn't care about peaks vs. dips. Thus, the predictive power of NBD_ON is weakened, and PCA reduces its weight in the model. (The case of NBD_PIR is more complicated because of its interaction with SM_PIR.) In contrast, if we're talking about EQ'd speakers, then there's no reason not to fill in dips. One could easily come up with other examples of how EQ'd speaker design might differ from non-EQ'd. Another example off the top of my head is that frequency response deviations that affect all angles equally become less important because they can be EQ'd away, which means the speaker designer can decide to make tradeoffs such as EQ'ing a resonance instead of trying to physically remove it if that makes it possible to, say, achieve better directivity control. My point is: the model was likely not designed to assess speakers that are designed that way. They are not "typical speakers" as far as the model is concerned. Therefore, the predictive power of the Olive model is likely weakened on an EQ'd speaker, limiting its usefulness as a target for EQ.
Now, are all these problems real obstacles to using the Olive formula as an EQ target? I have no idea. It's impossible to know without doing more experiments (or having access to Olive's raw data). All I'm saying is, I don't think naive, simplistic, jump-to-conclusions statements such as "Olive model was designed to correlate with preference on a typical sample of loudspeakers, and as such it presents a well defined design target" are useful. I think we should reserve judgment and not overconfidently extrapolate from studies that were never designed to be interpreted in that way. And again, I'm not fundamentally opposed to the idea of using the score as an EQ target; as I said before, I think that's better than doing nothing, and it might actually work. But summarily dismissing other approaches (such as targeting a flat ON/LW) based on a perception that the Olive model is always superior to everything else in every possible use case (such as EQ targetting) is unwarranted and overconfident IMHO.