What makes you think that? R2R ladder DACs are really simple (I designed and built my first one in 1975 or so), but it is a really primitive architecture that has long ago been superseded by better, more advanced designs.
The measurement of the 1421 are too bad than believable, that's it.
And my reasoning is not about R2R vs DS, it is more general.
Regarding R2R being primitive, if you say so. DS is not necessarily better, it is cheaper. That's why it is dominating also in integrated designs. And they have some problems, esp with complex transients. That's why they have been made more complex, such as multi-bis delta sigma, and with various types of oversampling and processing. OTOH a good R2R design will, as you probably know, convert the the upper bits with a single resistor, and then move to a "real" R2R chain, use various types of aliasing and averaging, together with self-measurement and correction. But they will (at least mathematically and with ideal resistors) represent transients more faithfully. OTOH with delta sigma you can get more easily "perfect" measurements.
What "variable signal" measurements do you suggest, and what additional information would they provide?
Signals whose spectrum varies at more than one point in time. Even a very fast glissando would be interesting to observe, since there would be no periodic signal. They would provide information about the behaviour of the DAC with a quickly varying signal. Second example: two very, very close transients. Examples: when in an orchestra you have different percussion instruments playing at the same time – two players never hit a note exactly at the same moment in time, or if you have an ensemble of strings and they all play a pizzicato. There is never perfect synchrony. How does a DAC (or an amplifier) react to dozens of transients which are very very close in time? Even a string playing usually has a vibrato. A singer does vibrato as well. The pitch is never constant. This is never tested. I want to see the results, and then we can say what they provide. Oftentimes "perfect" DACs fail at reproducing strings in a realistic way (many DS, BTW), and even though this is a subjective assessment, this *should* point at something. This should tell us that
maybe something is missing.
TBH it would be interesting to see these studies and a correlation with the other measurements and subjective perception. Everything has been simplified to just THD, noise, maybe IMD and behaviour under a few other test signals. Maybe is the mathematician in me that sees the incompleteness, but all papers I have seen claiming the "sufficiency" of some limited tests suffer from logical fallacies (they claim to prove that by stating "these are two frequencies which are common in reproduced music, and all within the hearing range of an adult, so let us add them together and measure"). i would be curious to read more.