3. Preservation of time coherence is beneficial. Within my limited range of experience, some crossover topologies seem to image better than others even if the differences in their measured frequency response is minor and inconclusive.
4. Room interaction matters a great deal. The first-arrival sound should be followed by a time gap wherein minimal reflections arrive, the time-gap being followed by a fairly generous amount of spectrally-correct reflections. These later-arriving reflections ideally arrive from many directions, and should be neither too strong nor too weak, decaying neither too fast nor too slow. This time gap can be a result of room treatments, room geometry, and/or speaker radiation pattern geometry, and of course set-up of the speakers within the room should enable the reflection-free time gap
Imo this package of room-interaction characteristics results in two benefits: First, by minimizing the earliest reflections, the sound image localization cues on the recording are not smeared by early reflections, which improves the "physical, tangible impression of the presence of the voice/instrument". (Simultaneously, the "small room signature" cues inherent to the playback room are somewhat disrupted.) Second, by providing lots of spectrally-correct late reflections, the ambience cues (such as the reverberation tails) within the recording are effectively presented to the ears, potentially dominating over the playback room's signature and enabling a "you are there" perspective.