Not fully
But here's an attempt at a simple explanation...
The function of a crossover is to filter the signal into different frequency bands. Any time a signal is filtered, ringing is produced as a byproduct.
In the case of conventional analogue filters and digital IIR filters (i.e. minimum phase filters), both non-constant group delay (the delaying of some frequencies in relation to others) and ringing result. Lower frequencies arrive later than higher frequencies (non-constant group delay), and ringing occurs after the initial rise of the signal (post-ringing).
The magnitude of both phenomena is dependent upon and proportional to the steepness of the filter slope (steeper filter = more ringing and group delay).
However, non-constant group delay can be avoided by use of a linear phase FIR filter. With such a filter, all frequencies can be made to pass through the filter simultaneously, with no relative delay between different frequencies.
But there is a cost: For the FIR filter to avoid non-constant group delay, it must create not only post-ringing, but also pre-ringing (ringing which occurs prior to the output of the signal).
This graph shows examples of a filter that exhibits only post-ringing (minimum phase, blue) and a filter that exhibits both pre- and post-ringing (linear phase, red):
It's not shown in the above graph, but it can be inferred that the minimum phase filter will also delay low frequencies in relation to high frequencies (non-constant group delay), whereas the linear phase filter will not, i.e. the linear phase filter will allow all frequencies to pass through the filter simultaneously.
It is generally accepted that, above certain thresholds, non-constant group delay, pre-ringing and post-ringing become audible, although the question of where these thresholds lie remains contentious. Various studies have been undertaken, though arguably not enough.
Blauert and Laws investigated the audibility of group delay at various frequencies and concluded that the thresholds were in the range of 1-2ms in the ear's most sensitive region (around 1000Hz IIRC). This corresponds to a phase shift of a bit more than 360°, or slightly more phase shift than that created by a conventional 4th order filter. Subsequent research suggests that these findings are roughly correct, but that it's a bit more complicated, as what is important is not the degree of phase shift/group delay across the whole audio band, but rather within an ERB (which replaces the concept of "critical band"). In other words, the rate of change of group delay is arguably more important than its absolute magnitude;
@j_j is the expert on these matters.
Post-ringing is known to be largely masked by the signal as a result of forward masking, and is not of concern under normal circumstances (I'm sure exceptions can be concocted, of course).
The question of the audibility of pre-ringing is much more complex. For most people, there is a backward masking effect, such that a louder sound (in this case, the signal) will tend to mask a softer sound occurring up to 20ms
prior to it. However, unlike other forms of masking, there is wide subject-to-subject variation in backward masking's duration and degree, with some subjects (particularly with training) exhibiting almost complete immunity to backward masking.
Moreover, IIUC, even in cases where a backward masker renders a sound prior to it in time inaudible, the prior sound may nevertheless affect the perceived loudness of the later/louder sound. In terms of pre-ringing, this implies that it may have a negative effect on the perceived loudness of transients in a signal. This is unfortunately, however, where the extent of my understanding of this topic ends...