Transferring digital audio is like transferring any other kind of digital data. It can be push or pull. By "push" I mean the source sends the data to the destination, which receives and interprets it. By "pull" I mean the destination requests data from the source, which means the source sends chunks of data "on demand" - when the source asks for it.
With digital audio, the data consists of amplitude samples, so the rate at which they are sent is important. That's what the clock is for: to ensure the samples are interpreted at the correct rate. For example CD is 44,100 samples per second which is 1/44100 = 2.268e-5 second spacing between samples.
In the "push" scenario, the source uses its own clock to ensure it sends samples at this rate. The destination can buffer the samples and use its own clock to space them evenly. But, if the source & destination clocks are not perfectly in sync, if one runs slightly faster or slower than the other, the buffer will eventually over or under flow and you'll get a glitch. The best the destination can do is buffer the samples and use its own clock to compute the average rate at which samples are arriving and reclock / respace them using that average. Its own clock might not agree this average is correct (e.g. 2.268e-5 gap between samples for 44100 rate), but it must use whatever rate the source is providing because it doesn't control the rate. At least this buffering and re-clocking will correct any sample timing differences caused during transmission.
In the "pull" scenario, the destination uses its own clock to determine the rate or spacing of samples. It requests chunks or batches of samples from the source, according to its own (the destination's) clock. For example, with CD audio it could ask for 44100 samples every second, or 22,050 samples every half second, or whatever. Of course, it will ask for samples ahead of when it needs them, store them in a buffer until needed, and read from its buffer at the exact rate determined by its own clock. In this case the source doesn't even need a clock, it simply transmits chunks or batches of samples on demand.
This is an oversimplification but I hope it conveys the general idea...