I have already answered on the relationship between aspects related to ITD/ILD and transient steepness: it changes the flow of information that reaches our brain to process spatial information.

Your reasoning is correct if we analyze the signal in its entirety. But our ear doesn't know the future time course of the signal: it relies on what it has detected up to that moment to decode it. Now, while for the analysis of timbre (an unscientific term, I know) the harmonic analysis reveals a lot of what we can perceive, for the analysis of spatial information what counts are the instants in which the variations occur, which the classic analysis with Fourier does not highlight. In the more sophisticated models wavelets are used (it is no coincidence that the filters with which the cochlea's behavior is simulated have a pattern that closely resembles them), but here things get quite complicated.

Wavelets are irrelevant here, unless they are wavelets designed to mimic the ERB structure of the ear, including the eardrum and middle ear. Since wavelet transforms aren't lossy, that's not going to happen. (making a multiresolution system does not necessarily imply wavelets, please, there as many ways as there are needs, if not more)

Short term signal analysis does not require wavelets, and that can just as easily be done using a Fourier basis. One could also use a Cosine Transform Basis, a Sine transform basis, or many others, however the complex exponential is very handy in relating analysis to the behavior of the ear. The statement about 'a signal in its entirety' simply ignores the actualities of what's long-since done. One can window even the infinite Fourier transform, or use a window known to match the ear's overall window, of course, one can better just use a good impulse response model of a given cochlear filter. Or even a complex model, to capture the energy as a function of time, in a given bandwidth. For that, please see "gammatone" filters, as well as more complex stuff that's available in the periphery of the literature (that is more useful for analyzing masking thresholds, for instance).

The widest bandwidth ERB is under 5kHz, I'd argue for 2.4kHz. That bandwidth (note, this is not 'maximum frequency' or any such nonsense, but bandwidth) determines the width of the impulse response, REGARDLESS of minimum phase, constant delay, whatever kind of impulse response you care to mention. Fourier analysis exists not only in an 'infinite time' range but also as the time-limited fourier transform, and as the DFT, all of which are informative. What's more, by using the fourier transform, one can even develop the signal envelope as a function of time (in a given frequency range or not, take your choice, and with a particular filter shape or not, take your choice, the duality theorem still works fine) by simply making the analytic signal (hilbert signal) and calculating the signal envelope from that. David Hilbert was a very smart man.

No, wavelets are not very similar to cochlea filters, either. It would be convenient if that were the case, but it's not, and they don't. A proper wavelet requires a particular set of conditions, and a wavelet transform is not a lossy transform. The EAR is extravagantly lossy in very unusual ways.

As to "changes the flow of information", no, you haven't shown that, or even proposed a proper method. Furthermore, the nonlinearity involved is that of firing the inner hair cell. Yes, a bad filter can cause what's effectively a pre-echo and affect the partial loudness stream, no doubt, as has been demonstrated many times to matter down to sub-one-MILLIsecond range. The first firing of an inner hair cell has almost exactly (surprise!) the time detection accuracy one would expect from elementary mathematics, too (gets you down into the 5 microsecond range using reasonable noise considerations for the inner hair cell). Using a very generous bandwidth for the widest cochlear filter and the highest reasonable SNR one gets to about 1 microsecond. None of this is either unknown or particularly telling.

The ***ONLY*** issue that remains that might be a plausible mechanism is interaction of a steep filter with the cochlear filter, causing pre-firing of the inner hair cell. This is why I would prefer higher than 44.1 sampling rate BUT THERE IS NO EVIDENCE THIS IS ACTUALLY A PROBLEM, it remains, as it has been for 30 years now, a hypothetical mechanism. Remember that it's not the highest frequency a low pass filter allows that controls its impulse response length, but rather the steepness of its transition band and the required rejection in the stop band.

PCM easily reaches the ITD limits of the ear. ILD is not an issue in this discussion, and distortions, likewise. The only question is signal detection.