To address the question of whether we are comparing apples-apples or not, I see three main useful cases for AVR measurements:
1. Use the analog input to the cleanest path available in the AVR to the amp output. This is useful for comparing between AVRs, integrated amps, pure amps, etc. It is an evaluation of the build quality of the amp part of the AVR and important. If the cleanest path possible in each varies, it does not matter. Let AVRs that provide the best performance in their best path win.
2. Use the digital input to analog pre-out in the cleanest mode possible for an AVR. No bass management, no surround modes, ideally just the DAC and no other uses of a DSP. Again, if a manufacturer allows a more cleaner path than others then good for them. This is useful for comparing the DAC performance between AVRs themselves and benchmarking against pure DACs. This is an evaluation of the engineering of the digital path of the AVR and its noise isolation, etc.
3. Same as 2 above but the most common use case that engages the DSP. In other words, a set of things like bass management, room correction if available, etc. These are things that almost everybody (if not all) would be expected to use all the time (so no need to look at surround modes and all the other bells and whistles which is there only if one needed it). The set may vary from one AVR to another but it does not matter since it will be the most common mode for each AVR.
Not saying the above is effortless. Just 2 itself as being done takes a lot of time but I do think the above create a more complete evaluation of the AVRs between themselves to answer the three basic questions:
1. How good is the amp?
2. How good is the digital input handling in the pre/pro?
3. What penalties if any does one pay for using the most necessary/common feature of the AVR in real use?
There are a lot of boundary/special cases than the above but I believe the above will sufficiently capture both an evaluation of the engineering quality and what people can expect in practice.