To measure ITD for impulsive sounds, the envelope is used. To measure for periodic sounds, phase is used.
I understand that this might come across as semantics, but please bare with me, I want to try to shed some light on the topic to make it intuitive.
The term "impulsive sounds" may refer to a transient. But lets first establish a crucial definition. If we look at any change in amplitude from zero to some kind of signal, it is a change to the steady state (which was zero, but has become non zero). This change is called a transient. When a sine wave starts or stops, it is a transient. But if we play back a sine wave burst through a subwoofer, these start/stop events are mostly filtered out.
So if we take a look at Fouriers take, which is generally accepted as fact, in order to have a pure sine, you can not have a start and stop, it has to last forever. Even a slight change in gain would mean there is a very slight transient involved. This is what can be described as a pure single frequency event. In a Fourier transformation (which is a transformation between time domain and frequency domain for a given signal) we can see that if this is infinitely narrow in the frequency domain, it is infinitely wide in the time domain.
On the other hand, if we make a perfect transient, it will be infinitely sharp and short lasting. This would be infinitely narrow in the time domain, and infinitely wide in the frequency domain, so in this case, we will need to have far more frequencies than the audio band.
But there is another property to this that is quite interesting. If we look at a sine wave being introduced gradually, it will be quite long lasting in the time domain, and it will be almost as narrow as a single frequency in the frequency domain, but not 100% pure. We can add that we also turn down the gain again afterwards, ending up with a smooth rise and fall to the sine wave. If we look at the macro level of this, it is pretty much the same as the gain setting we used. This is what is called the signal envelope of this signal. We do not follow the curve shape, just the amount of all the frequencies occuring in the signal.
With a sine wave as pure as this, we may talk about a sine wave for practical reasons, and talking about signal envelope is also relevant. However, we have to remember that we have very little frequency content, so we do not have a leading edge. This means, if we look at the wave shape in the time domain, it is even hard to visually see where this signal really starts, and when it does, the SPL is probably way below the noise floor. This kind of signal envelope does not contain information that is useful for our ears in terms of detecting a timing reference, and therefore ITD does not apply.
We can also look at a case where we do have a transient that is near perfect considered that we ignore mathematically perfect for a while and only focuses on the audio band. If we look at the frequency content, the duration of the transient will give us information on how low this goes in frequency. The relationship between the raise/fall and the peak amplitude will give us the upper frequency. This means that if we have a DC step (a sudden and permanent change in steady state voltage from say zero to 1 volt) we even include 0Hz in the transient. We can look at a typical transient where all the frequencies are in phase, and we see that it represents a sudden rise and fall both when we look at the time domain waveform, and when we look at the signal envelope. So here the signal envelope and the waveform are actually the same.
Then we can add some phasing errors. If we introduce a couple of phase shifts, we will see that the signal wave form is now suddenly consisting of a mess of positive and negative peaks. This means the transient waveform looks very different, but the signal envelope is unchanged, so now the two have become very different.
We can apply this to hearing by trying to compare these three cases:
1: A perfect transient.
2: A perfect transient with phase errors.
3: A filtered low frequency signal.
1: If we have perfect speakers and a perfect waveform, we will have a super precise location ability by our ears ability to measure ITD precisely. It is not much more to say about this, other than that talking about signal envelope or transient does not make a practical difference, and phase does not play a role since this is all just a leading edge.
2: This is where things start to get interesting. Our ears are evolved by nature to provide us with the ability hear stuff that could help us survive, simply by killing those who could not locate the subtle sound of the approaching danger. This noise would normally represent a transient, and our ability to locate the angle and determine the distance was key for our survival. Once we use this function of our ears, but we introduce some phase errors, we get most of this ability, but we loose some. For determining distance by transients we rely mostly on frequencies above 1kHz, and if we have a phase error above that, we tend to loose the ability to precisely determine distance. In some cases, we are also able to disturb our angular perception, but then the phase errors has to be quite large. The beginning of the leading edge is relatively intact if the phase errors are kept below 1k. But as I explained above, we have now messed up the transient, and if this occurs at high frequencies we do get a huge difference between the signal envelope and the transient waveform. The signal processing in our brain is not a very conscious process. We get used to such errors over time and will mostly filter them out as redundant information, but we can not by default pick and choose very well what to do with this added information. In other words, here ITD and IPD will work against each other. And that is really the key with IPD, it is not a natural event that we are evolved to deal with. We do hear it quite easily but it just sounds odd and it does not represent a way of locating sound sources since there are no common ways where this might give us anywhere near as much information as the timing information in a transient.
3: So if we make a perfect transient, filter it to <80Hz, and listen to it, we take away the high frequencies. This means we remove the leading edge completely, and replace it with a relatively slo fade-in as described above. We are now left with a signal envelope that looks like it could have some kind of timing information to it, but the leading edge is completely gone so the timing information from the transient is no longer there. It is impossible to say conclusively when the signal actually starts and stops since it is a very definition dependent question where the goal post is movable. So what happens if we run to the IPD-rescue?
This is probably the most interesting part of this discussion. But there is a rather problematic issue that stands in our way. A <80Hz subwoofer asked to reproduce a transient with a relatively good amount of low frequency content in it, will typically excite two frequencies, or maybe three. We can imagine the phase being similarly distorted for all of these frequencies for one ear relative to the other one. However, if we look at how we hear single tone phase distortion, it is not very precise. It just changes slightly between hollow, maybe to one side a bit, to very hollow and all over the place, to distinctive mono again. This is because we do not have a great set of skills to sort out what to do with this information. We also need the phase difference to be significant to be able to determine that there is actually something going on with the phase to begin with. Now, if we introduce more tones, but with the same time difference, the phase difference will be different depending on which frequency we look at. The common argument here is that we can use ITD to resolve this, but ITD relies on a leading edge, which we do not have, because we have filtered it out.
I hope this helps to show why this confusion of terminology is important, and how it is the foundation to understanding what the important difference between impulses, tones and signal envelopes means for actually understanding hearing.