Are you aware that the frequency response (magnitude+phase) shows
the exact same information as the corresponding impulse (or step) response, just presented in a different way? It is one of the crucial features of Fourier (and inverse Fourier) transform in mathematics, which is what is used to create these graphs.
It also means that the 'time-related' performance of a loudspeaker can fully be predicted from its frequency response (as long as both magnitude and phase is measured - which is normally the case). The two domains (time and frequency) are fundamentally connected.
I wrote about this before, with some examples - perhaps you will find it interesting; especially this part:
As you can see, the "steepness" of the step response (which may be what you call "transient response") depends mainly on how high in frequency the system can play.
Since humans can't hear much above 20kHz (and many much less than that) we don't need the 'transients' to be any 'sharper' than what you can achieve with a system low-passed at around 20kHz.
This is because our hearing itself is already low-passed close to 20kHz.
Any reproduced sound higher than the upper limit of your hearing will not make the 'transients' sound any more 'correct' to you - even if it makes the measured response look nicer!
Note also that we can measure much higher frequencies that we can hear - i.e. there's no problem to measure the step response of a system with an ultrasonic response. But we don't usually focus on that because normally there's no audible benefit.
For example, my measurement microphone is calibrated on-axis up to 25kHz, but I hear only up to about 16-17kHz.
Next, you might argue that 'zero-phase' crossovers are critical for accurate sound, but various well-regarded researchers argue that this is not the case in practice (here's a
link to the relevant AudioXpress article with quotes from experts in the field).
It might also be worth mentioning here that a single frequency/impulse/step response is not a full descriptor of tonality in a loudspeaker.
Since loudspeakers radiate sound non-homogenously in 3D space, every point around it will have a (more or less) different measured response to one another. This is why we need to look at the full spinorama when evaluating loudspeaker, and not just the on-axis response. We're looking for both 'flat' on-axis and 'good' (even) directivity.
But even the spinorama has limitations - it is designed to only describe tonality and therefore e.g. doesn't say anything about max SPL capability or non-linear distortion. Which is why we have various additional measurements to test for that.
The suite of measurements published for current Kali Audio monitors indicate these are very well designed, good value loudspeakers; assuming they are setup correctly and not expected to play louder than they can handle. Better loudspeakers exist, sure, but at the price these seem to be pretty good!