That is a bloody complicated question, to say the least. First, time coherence is a dual of phase. Remember that a time delay is exactly a phase shift of 2 pi f t, where f is frequency and t is the time delay. So a time delay is exactly equal to a line in phase shift, and the slope of the line is the time delay.

But does it matter? Sometimes. Phase shifts above about pi/6 or so inside of one ERB (or critical band) (n.b. ERB is a measure of the cochlear filter bandwidth, so are critical bands. ERB's are more accurate until you get down to very low frequencies, where there is some serious dispute, and a lot of confusion between frequency and time issues) can be audible. (we're talking monaural here) That's because they can change the firing time of the inner hair cells in the cochlea.

Phase shift between two far-removed frequencies is not audible.

Time delay, however, or phase inversion (aka how many crossovers work) CAN be audible if it creates a change in the signal envelope. This is where digital crossovers, among other things, excel. But this is a complicated issue, and I have to shy off being too precise here, sorry, at least for the time being.

Between ears, about 5 milliseconds is the shortest credible ITD. This, however, depends enormously on frequency and the signal structure. At low frequencies, time delay (or phase shift) can be audible. Between 500 and 2kHz, give or take, it's much less audible. Above that, it depends extremely on the envelope of the signal. Again, I must shy off being too, too specific for the time being.

Neither of these directly answers questions about step response. Remember, first, that step response and impulse response are mathematically related. So you can talk about either meaningfully. And the answer to 'does that matter' is more complicated than I feel writing about at 3AM. Sorry.