I was initially going to ignore this, but realistically I believe I can give some insight to some practical results on a product I'm working on now. Specifically, I designed a system which regulates bias current in real time according to several parameters. Since bias is regulated, it can be held at a significantly higher level in an optimum zone for least crossover distortion, and by inference, listening quality.
I conducted listening tests where I didn't tell participants what was being changed or altered, but just asked them their impressions. Although I knew what was going on, realistically the way it was set up, there was no way for participants to know what, if anything (and in many instances I did nothing) was changing or what they were listening for. In all instances, I was far away from the amplifier and out of eyesight. The participants had no in depth knowledge of the R&D I was doing, what my specific goals were, and how I was doing it. The bias of the amplifier in question was varied to several levels, and the result was in 100% uniformity for the bias being at one particular setting. I had done no AP measurements beforehand; I was just pulling bias levels essentially out of the air. After these listening sessions, I had AP measurements taken of distortion at at various power levels from 20kHz down to 20Hz. The results were unambiguous; the preferred bias level was exactly in the "sweet" zone of minimum distortion across the audio band where distortion was minimized.
However this bias level was also far higher (like about 200% higher) than could practically be set in an 8 channel home theater amplifier using conventional methods of setting bias, i.e not strictly regulated under all operating conditions. This was overcome so optimum bias could be set, with no downsides.
Also, to your point, this was obvious in the subsequent AP measurements. I purposefully did the subjective tests first to eliminate as much as practical any confirmation bias I might have had other than what was my design goal for the system.
So yes, audible differences proved to be real, and yes they were measurable. However in this case I did things in reverse to both test my design and to eliminate as much as possible any validation bias that what I was doing was correct - or not.
To address your post more specifically, the answer is yes, I have heard differences in other amplifiers. It has been my general experience that although amplifiers might measure "the same" static distortion, spectral analysis of the "worse" sounding amplifiers reveals higher levels of upper harmonic distortion, where the "better" sounding ones have mostly lower order (and specifically even order) distortion products. So this seems to validate both measuring and listening, which is what I do.
What isn't revealed in any measurements are (as I have posted repeatedly before) non-fidelity issues like if the amplifier makes pops or other noises upon turn-on or turn off. Also in the system I'm working with now, since it is an innovative way of accomplishing bias control, does this result in any transient noises or other artifacts? Attempting to do this by sitting by a 'scope, hoping to "see" a transient event would be very inefficient, and would not subjectively tell me anything about what any such event actually sounded like; was it an oscillation, a pop, a hiss, a sliding tone or any other artifact? If any of these things were heard (they weren't), the nature of them would have given a clue of what to look for in subsequent troubleshooting.
Really, listening for abnormal behavior is a routine procedure in the development process even in more conventional amplifiers. This is not generally talked about because it isn't "sexy" or controversial, but is is a very large part of listening to new designs. Although I don't have any specific experience on this, I suspect that a lot of listening was carried out in the development of class "D" amplifier modules.