This is an older thread but some points come to mind as I read it:
1. If we can't detect a difference (when level-matched) sighted, we don't need a controlled test. ABX tests the hypothesis that the difference we hear is not the result of sighted bias. If we don't think we hear a difference, there is nothing to test.
2. The notion of what might make amps sound different interests me, and this thread provides a good context for this. To my thinking, actual differences between amps might be caused by:
a. Inability to maintain linearity in the presence of a given load. It would seem to me that if an amp shows linearity into a 2-ohm dummy load, it should handle a dip to 2 ohms in a speaker, given reasonable capacitive load. So, when I see amp tests that show distortion vs. power into 2 ohms, I think I should see the amp's capability into the usual difficult loads. But if I felt the need to own speakers that were truly difficult (sub-2 ohm impedance somewhere in the spectrum, or a large phase angle coupled to an impedance dip), I'd want to see testing against a similar load. Stereophile has their simulated speaker load, which is not particularly difficult but which is probably relevant to most people with reasonable speakers. DonH56 has written cogently about this topic on ASR. It takes a particularly beastly load to create an audible effect (and even then the effect is only relevant for that load).
b. Clipping. Nearly all tests are conducted at steady state, or at listening levels specifically designed to keep the amp within its power envelope. Power output is usually tested separately. The question is how often a real signal might include peaks that clip an amp without us realizing it. The only way to know for sure would be to record the output of the amp and study the waveform. Or, we can depend on the clipping indicators in amps supplied with them, but usually with very little specificity on how sensitive they are to differences between input and output waveforms. I recall an analysis done by one magazine that found that their test protocol caused amps under test to clip to distortion levels high enough to potentially be audible. Frequency-response behavior with broadbanded input (like, say, pink noise) at clipping is rarely tested. I'd be happy testing low percussion music recorded very loudly, using a scope output or recorded waveform to determine the clipping point for a consistently used test track.
c. Distortion, which I define broadly as any difference between output and input (meaning that clipping effects are one example of distortion) across the frequency spectrum. Distortion at low frequencies is, to my ears (and I have tested this) more critical than distortion at high frequencies, particularly harmonic distortion, simply because the harmonics more quickly rise into audibility relative to the fundamental. This is particularly true when the low-frequency signal-under-test has no harmonics--a clean sine wave--which is thankfully rare in music. To my ears, distortion is really hard to hear when playing actual music, particularly harmonic distortion. Nearly all musical sources present a range of harmonics as part of their characteristic sound, so harmonics added by an amp would have to be loud enough to overcome the masking effect of the source material.
d. Frequency-response linearity. Most amps not subject to above issues are exceptionally good here, unless they are trying not to be. I have amps that roll off in the top octave by some fraction of a dB by the time they reach 20 KHz. I think even those with good high-frequency hearing would be hard-pressed to demonstrate their ability to hear such in a blind test.
Of these, it seems to me that only clipping is the effect normal people are likely to face in their home setups, unless they buy big amps and always listen at low levels. I know that I have have seen the clipping indicators flash on my (350-watt into my 6-ohm-nominal Revel speakers) NC502MP when playing highly dynamic and percussive source material. I would think a cleanly recorded mechanical metronome would be perfect for this--it can be played very loudly without summoning the constabulary. I have one such and maybe a good enough microphone to give that a try.
But people who routinely listen at low levels are the sorts who think, "I don't need a big amp--I only listen at low levels." Then, when the occasion arises that they want to crank it up--hello, clipping! Those occasions seem to me to include "putting an amp through its paces" in the context of a home "test". They then report differences here as being characteristic of the amps, rather than reflecting their excursion (so to speak) outside the amp's operating envelope.
I offer this as a summary of what I read in this thread, where subjective impressions were dismissed as phantoms when they might have had more concrete causes, but without having to insist that amps are routinely different within their linear range driving the sorts of speakers most people use. The point is not to dismiss subjective impressions, but rather to explain them, if possible, to validate them (or not).
Rick "not quite sure how I ended up reading this thread weeks after the fact" Denney