I don't see a problem with the test, either. However it's for the another piece in chain. Which you never change usually. It's a fine demonstration of a quality of DA-AD by itself, that's about it, the way I see it.
As for the test per se, it involves recognition and preferrence as well. So let's suppose you will hear a difference between each sample. Hearing is joy enough, you need to recognize the last gen copy. Recognition is based on presumption you'll prefer previous gen copies for fidelity. What if your personal preference might be different? With the mp3 I sometimes prefer codec compressed intervention to the original, if the original sounds so-so. So you see, even hearing a difference won't ensure which copy is last gen - to ensure this, each copy should sound 'worse' to your ear as well.
If you don't mind a colourful description, this test reminds me of hypothetical situation where the food is eaten then digested and thrown out through natural outlet, which is then eaten again. Then repeat the cycle 8 times in a row
and the task is to recognize the final sh*t