But if we don't care, why there's an audible difference then? And what is "That level"? I know you guys want to see that on a piece of paper and for that data or graph to be accurately representative of what would be heard in a real world scenario when throwing music at it. I don't think is possible.
Heresy you might scream, there's this test, that other test, etc. but pink/brown/white/whatever color noise is always that, same for a sine sweep. It's that thing, that sound. Songs aren't. They aren't static in amplitude or frequency response nor else, otherwise every song would sound the same. I think the test sounds we throw at machines are pretty representative and helpful, but they are a sample approximation of what happens when we listen to all sorts of material.
I give you another real world example of how saturation can be hard to detect: same stem, 2 different versions with 2 different levels of saturation, both show no difference in the frequency response graph and have same level across the spectrum and loudness matched, any eq plugin with matching eq function did exactly nothing, yet the more saturated one would sound denser and fuller. Now if I put the saturator plugin in something like plugin doctor I of course see the harmonic creation, and except 1 spike slightly above, the rest was below 100db fs on a graph. Now me nor my colleagues have bionic ears, in the real life scenario was audible to all of us. It compounds.