As you prefer to post apodictic claims instead of explanations, I see very little evidence for that.
So, irrelevant to this discussion
Actually very relevant, as it describes the subjective impression people get when doing such experiments.
KEF Q11 Meta. And many active studio monitors.
Could you name any of the active ones please? I am not familiar with the KEF, but I guess there is no point in naming a speaker as an example for imaging which does not meet basic standards of on-axis linearity and off-axis linearity alike.
Simple - with listening to single loudspeaker in a room, taking (subjectively) reflections into account.
That is not the same, and you probably know it. Reflections (particularly the early ones affecting localization) in a room originating from a single speaker reproducing monaural content, from the perspective of our ears always support the localization of the true existing source, the mono loudspeaker. Even if they are dominating.
That is not comparable to their influence in a stereo setup where they would also help the ears revealing the true localizable sources - the two loudspeakers - while the intended localization is the phantom source somewhere in between the two real sources. The result in terms of localization is not predictable or transferrable from the monaural experiment.
If the measurements are similar, than both speaker models are equal in localization stability.
I do not disagree, but that is a theoretical scenario. If you measure interchannel differences in a real room taking the reflections into account, you will always end up with differences way greater than your proclaimed threshold. It might vary from speaker to speaker as completely identical driver layout and directivity alike are quite rare to find (even in a series like Genelec 83xx which were designed to provide such).
Knowing that there are differences exceeding your threshold would not help. I do not see a way how you would be able to predict which speaker offers which imaging, localization and ambience characteristics from these measurements. If you have a model to predict exactly that, it would be helpful to know how it works.
For evaluating if the image is more distant or near (or flat), you have to listen/evaluate/review single loudspeaker.
No, you cannot. In a monaural scenario, you have a completely different way of reverb pattern contained in the recordings being reproduced, as these are also monaural (if we are not talking about a completely dry recording, which would be senseless to judge distance or depth-of-field). In a two-channel scenario, a good portion of the reverb pattern is relying on phantom source principles just like the direct sound affecting the localization. How these play together with the reflections in the listening room, is key to how we perceive depth-of-field and distance. You cannot judge that in mono at all.
The other aspect you seemingly ignore is localization stability. In a mono setup, you deal solely with real sound source localization. Stereophony relies on phantom source localization, with little interaural tolerance being one of the factors, but by far not the only one. Results of an experiment done according to one of these principles is not transferrable to the other.
To doublecheck this, I recommend to listen to speakers with an MTM arrangement dubbed d´Apollito or virtual pointsource. Preferably with bigger midrange drivers, greater distance between the latter and higher x-over freq, preferably in a near-field environment. Something looking like this:
In mono, these will present, if well-designed, an almost perfectly stable localization, as all localizable frequencies seemingly originate from one point, the tweeter axis. The way we perceive localizable content from the midranges, can be compared to phantom source localization (although it is not exactly that). In contrary, when listened to in a stereo environment, we have to deal with six real sources supposed to form one stable phantom localization, which is prone to fail particularly under nearfield conditions or when direct reflections are involved revealing the direction of the real sound sources (sidewalls, mixing console, desk or alike).
One might say this is an extreme example, but something like that on a lower degree is at play basically with most of non-pointsourch speakers.