Very briefly, as there is a lot going on.
For jitter the entire problem is to create a stable local sample clock in the face of incoming data that both defines the sample rate and has timing variations. Furthermore, we are only concerned with jitter that has energy in a range of frequencies that can affect the audio.
Ideally the DAC would provide its own sample clock and the incoming data stream would be slaved to that. In general this is just not viable. But for specialised cases, such as a USB DAC that is using the asynchronous transfer protocol this is possible. But the moment you get any more complex that a single device, you need to synchronise the local sample clocks.
A phase locked loop is simply a way of managing a local clock in some manner so that it tracks, on average, an external clock. By comparing the phase of the external and local clocks one can derive an error correction signal and use that to nudge the local clock to keep it in sync. But that nudging itself adds jitter to the local clock. The question becomes how to do the correction in such a manner that the management of the local clock is done so that jitter in that clock outside of frequencies that matter for audio. This turns out to be harder than one might hope. The obvious thing is to perform the changes to the local clock gently, taking an average of the error signal and applying that. This is another way of saying that a low pass filter is applied to the error signal. The trouble with that is that eventually the error is applied so slowly that the PLL is unable to find synchronisation with the input in the first place, or drops out of sync if the input varies too much. The overlap between audible jitter energy, a slow enough filter to remove such energy, and yet fast enough to allow the system to be able to keep sync is very delicate.
Lots of tricks and adaptive controls are possible here. Only using occasional samples from the incoming clock is another way of trying to remove unwanted energy from the jitter spectrum of in the incoming clock. The ESS mechanism is another way of finding a benign way of estimating the incoming clock and creating a local error estimate. Again one that can be applied to the local clock in a smooth manner that does not put energy into the audio in bad ways.
The other way of managing jitter is to resample the incoming data and feed that to a DAC that is running with a local sample clock. But what does this mean? In the abstract, you want a way of arbitrarily resampling the data, taking any possible incoming sample rate and converting it to a different sample rate. That is near black magic. In the extreme, you feed it to DAC, feed that to an ADC, and you can resample anything. But it misses the point. If you have two sample rates that are simple ratios of one another sample rate conversion is easy. For instance converting 48kHz to 32kHz was done to provide digital feeds to FM radio stations (that are bandwidth limited to 15kHz anyway.) This was one of the reasons for 48kHz being adopted as a pro standard. Arbitrary sample rate conversion came a lot later. Conversion from 441.kHz to 48kHz was for a while simply not possible. Next, doing so where the two sample rates could vary relative to one another (the asynchronous part) came later still.
Ideally, with an ASRC there is an incoming stream with its wobbly data clock, and a different local clock. Bingo, you can have your local clock, and don't need to slave from the external clock. But that ignores a lot of grief inside the ASRC. ASRCs don't work well if the two sample clocks are close to one another in frequency. Which is why you often see internal clocks of non standard sample rates used, and samples converted both up and down in frequency to meet the one internal rate. They are not magic, and still need to be fed reasonably stable clocks.
In general a ASRC is not the same as the oversampling mechanism in the DAC. The nature of things is such that you could conceivably optimise the operation of a system so that there was some commonality, but oversampling has a very different reason for existing and is independent of an ASRC. You could trivially use an ASRC with a non-oversampling DAC. Delta sigma DACs oversample intrinsically as part of their operation in a manner quite independent of any external sample rate conversion.
So can I assume that a first jitter correction is done at the USB input by the xmos receiver resampling the incoming signal (because USB chip is known to be asynchronous these days) to clock it with the dac master oscillators. And then for SPDIF input, ESS does in its turn a high speed resampling to the signal at the chip to get a good controlled metric.
This is confusing uses of the the word "asynchronous" in lots of different contexts. Asysnchonous simply means without synchronisation. Simialrly "sampling" is being confused. All references to
resampling in the this context refers to changing the sample rate of the audio. Sampling however can mean sampling anything, and here, usually the clocks. Nothing to do with the audio. Asynchronous USB simply means that the USB receiver is not synchronised to the USB source. It says nothing about the payload. A USB receiver will not resample the audio.
The ESS jitter control does not resample either. It samples the clock to get a metric of the clock rate, but that refers to the data clock, and not the audio samples. There may be a ASRC present, but that is quite separate. Oversampling is separate again.