Good question. That's missing here quite prominently, the missing piece in the puzzle, the 'why'.
Or it is addressed with speculations only, not even questions are raised. There's no perspective to investigate the case. This is a bit of a pitty. Agreed, we needed the standard, as a standard, and the world was shocked when it described the ideal speaker from an engineering point of view: flat, controlled, and good bass, "because it is haard".
I can tell, I'm lost within seconds when equalizing my headphone via pink noise without a linear reference. Gross deviations, o/k, but the finer details no way.
This raises the fundamental question, where the test panel gets the reference from, what is good, true, whatever you name it. Is it from memory, sure, because you reasonably cannot listen to two speakers at the very same time. But what actually was put into memory, what criteria made it - a human is not a tape recorder! The impression has to be understood to become a memory item. Understanding is an abstraction, measurable (in the sense of the senses) criteria, more or less of it, and more or less of the other.
Do we know the criteria?
Do people listen differently when evaluating speakers (in mono) versus listening for the fun of it?
If there is, what is the reference, the 'ideal' to compare memory items against current impressions?
Harman has omitted to ask these questions. They are engineers, tageting a market, not participants in a scientific program, fair enough.
So, as people forget, when listening in intended stereo mode, about all the virtues that make a speaker in mono, what's the clue?
People buy speakers in 'speaker evaluation' listening mono mode. Then they are used in 'fun' listening stereo mode. So? (Except for those who sport 'critical listening'.)