He talks about how you don't need to have a sample at the start point of the sound - i.e. that the start times of sounds aren't quantized to the sample rate. I totally get that and his animation is great for explaining that.
However, his samples seem to move perfectly smoothly and aren't quantized to available bit-depth values at his sample positions.
Would this mean that the start time of a sound is quantized to some combination of bit depth and sample rate?
Of course there's only one meaningful follow up and that is "if that's true, does it matter?" I'm guessing the answer is "no because you can't tell that fine grained a change"
"The start time of sound..."—that's a dicey proposition to start with. Let's say I'm recording myself, digitally, playing piano and singing. There will be a gap of "nothing"—not really nothing but probably a little hiss of the mic preamps, too low to hear. I could digitally edit that out, so it's maybe a few zeros, then the rising samples of my first note. But even that, if you zoom in closely enough, probably starts out below the noise floor of the mic and preamps. Or, if I'm using a gate, then the gate dictates where the rising edge is, but will lag reality. Not trying to be annoying, it's just ambiguous exactly where a sound "starts" out of supposed silence. So, is it really that important that the digital recording "starts" where the sound does? considering your ear or a gate circuit can't tell, exactly?
Further, strictly speaking, any signal that is time limited by definition can't be frequency limited—and since the sampling theorem requires that the signal be bandlimited to below half the sample rate, you can't sample anything with an arbitrary start time. (Yes, even a 1 kHz sine wave that starts at an arbitrary time is not bandlimited.) OK, so we slap a lowpass filter in front. Well, if we continue to be a stickler, a perfect linear phase lowpass would require an infinite time in front of the start, so we're screwed there, but let's get realistic and say that a practical filter won't be perfect, but it will only be wrong for a short time, linear phase or otherwise.
And that's the bottom line—we get close enough, and we don't even try to catch the supposed first sample anyway.
So, back to sampling: The most important thing to understand is that individual samples don't mean much. It's the train of samples that matters. The original signal itself is modulated by a pulse train (PCM—Pulse Code Modulation; pulse amplitude modulation, coverted to "codes"—digital values). That's the key to why sample rate doesn't determine timing resolution (as long as bandwidth is below half the sample rate—this restriction itself puts limits on how fast timing features can change).
Hope that helps, but let me know if you would like me to expand on any of that (especially the last past, I suspect).