One has zero to do with the other. Timing precision is NOT a function of sample rate. This is a very common misunderstanding.
Please clarify. Are you saying that sampling at 1 Hz can provide the same timing precision as sampling at 1 MHz?
I'm sorry to say but the sampling step is not the limit for accurate timing, this is just plain wrong. As a proof one can create a 44/16 wav file with two channels where one channel contains a 20 kHz sinus burst and the other channel contains the same signal shifted in time by 1 microsecond. Then play the file and look with a scope. You will see both bursts with a difference of 1 microsecond between both channels.
You are using a scope and a frozen graph of the signal amplitude. The hearing system doesn't have such luxury. It can't "see" even instantaneous value of a signal, leave alone a high-resolution recording of the signal over time. Apologies in advance if you don't need a nano-lecture to follow. In any case, it may be educational for other readers.
Right in the inner hair cell, we have a rather crude rectifier, chained with non-linear integrator, with the result of the integration being non-linearly decaying over time. Once the result of the integration exceeds a threshold of one of the synapses connecting the inner hair cell body to an auditory nerve fiber, the fiber spikes. Then the fiber has to rest, for about 2 ms, and won't spike again until its readiness is restored, even if the result of the integration grossly exceeds the threshold. There are several fibers with varying thresholds attached to an inner hair cell. They work together in a manner described by Volley Theory, successfully encoding signals with frequencies higher, and intensity range wider, than a single fiber could.
For signals with low frequency (definitely for the ones below 1 KHz) and constant or very slowly changing amplitude, the hearing system can figure out the pitch and intensity of the signal by correlating the timing of the consecutive spikes in fibers with varying thresholds, predominantly happening at certain phases of the sinusoid: this mechanism is called Phase Locking.
For signals over 5 KHz, this scheme falls apart, because the fibers can no longer track the sinusoid with sufficient time resolution. Instead, the hearing system has to determine the pitch using the Place Coding (place refers to the most excited location on the basilar membrane). In between 1 KHz and 5 KHz, the hearing system employs both mechanisms, correlating their outputs.
So, for frequencies below 1 KHz, I accept you argument. The hearing system can indeed detect the phase shift and sort of "see" the phase difference between signals arriving to the right and left ear. For frequencies above 5 KHz, this doesn't work: both ears "tell" the brain that they detect a signal of that frequency, and report the intensities difference, yet this is too crude information to detect direction to the sound source with high accuracy.
Evolutionary, detection of direction to a sound source emitting a quickly decaying burst of high frequency is very important: a common use case involves a predator or an enemy stepping on a dry tree branch. When this happens, the hearing system processes this as a transient instead of as a steady tone. As we know from Fourier theory, a transient transforms into a wide range of frequencies. Many auditory fibers spike at about the same time, and brain gets a different kind of signal, with left and right ear now reporting the times of the offsets of the signal. From this inter-aural time difference, which can be resolved with precision of about 5 microseconds, the hearing system determines the direction to the sound source with significantly higher precision.
Now imagine that instead of hearing the sound directly, you have headphones on, fed by microphones attached to the headphones, through a chain of AD to DA conversions with 44.1 sampling rate. Depending on the initial amplitude and decay function of the burst, direction to the sound source, as well as on the specifics of the AD and DA conversions, you may either (A) don't hear the burst at all, (B) only hear it in left ear, (C) only in right ear, (D) hear in both ears with the same onset time, or (E) hear it in both ears with onset times quantized by the sampling rate.
The higher the sampling rate, up to the neurophysiological limit, the more probable that the outcome will be (E). So, using a sampling rate in excess of what the Nyquist Theorem would suggest helps with resolving direction to the sound source, thus allowing to place higher number of distinguishable sound sources into the angular extent of a mix. Once again, this doesn't matter much for pop music and typical audiophile benchmarking music fragments. Matters a lot to enjoyment of music with finer spacial and temporal structure.