A lot of criticisms of how I measure amps is already addressed. Let me summarize:
1. My disagreement with OP was picking the highest/worst clipping point for one amp, and comparing it to my measurement where I pick the bottom of the hocky stick. You can argue that my measurement gives variable power rating for amps of different design. But you can't argue that picking the highest, worst clipping point for an amp is valid. That point can too vary depending on amp design and not at all correspond to 1% THD as it happens to do in this measurement. Had OP used a higher input level, it is possible that it could get worse than 1% THD as well.
Picking the knee of the curve gives the *design parameter* for the amp as far as power. It generates conservative watts rating as well but with variability due to lack of resolution of the graph at that point especially if the peak is very vertical. This is addressed in #2.
2. To deal with variability above, good while ago I added a new measurement that uses the same THD+N for the power measurements. It does this for both "continuous" and "peak" ratings:
3. I do run power vs frequency tests. Here it is:
This is superior to picking one power output and running the sweep vs frequency. You see full data for complete power sweep, albeit at discrete frequencies. Notice how picking the "knee" helps here to show the limited power at 20 Hz.