The reason for frequency domain comparison is that our hearing is tuned to frequency discrimination as one of the primary mechanisms of identifying and differentiating sounds. Ears don't do sample by sample comparison in the time domain, which is what RMS error metric or DF metrics do.
Frequency domain discrimination alone doesn't account for forward masking, or the much lesser effective backwards masking. For example, take two different dynamic (eg signal burst) signals that result in the same 400ms average. What we hear will be very different if the first has all energy lumped into the first 100ms of a 400ms window, vs if it is comprised of two bursts, a lower level incident at the front end of the average, with a gap (eg 200ms) followed by an echo, or reflection. I think one unique and very powerful benefit of your metric resides in its ability to detect and uncover
time variant processes, intended or otherwise. Summing them into a 400ms average will significantly obscure that resolution.
That's actually where 400ms value comes from. Standards-based loudness measurements, such as LUFS, for example, are based on 400ms windows. So was DF Metric, which is why I picked it. If there's a better window size for loudness integration, I'd love to see some research on this.
Hmm, let me dig through my old notes. This was 25 yrs ago but IIRC, it was frequency dependent and shorter than this. Audiolense for example uses a fixed # cycle window for its in room measurement algorithm for this reason, and REW and MLSSA support similar windows
PEAQ is interesting ...Thanks for pointing it out!
You're welcome, hope it leads to something. My understanding of PEAQ is that it's perceptual model is much more complicated than what we're talking about here.
I had one other idea to share. Way back when I wrote a test algorithm similar to what you're doing but with the goal of uncovering time variant processes such as DSP signal dependent gain, compression etc. The basic idea was to take any music stimulus you wanted, use a steep notch filter, and then embed a tone in the notch whose amplitude was modulated by the signal envelop so as to minimize disruption to the signal envelope. You'd then run it through the DUT, and compare output to stimulus by using a steep band pass filter, only looking at the tone. It uncovered a multitude of interesting signal dependent non linearities. I used it to reverse engineer the non linear signal processing of a third party device and presented to the company. They threatened to sue, thinking I'd hacked their firmware. So, I'd say it worked.

Different than what you're doing with pk-error, in that its not a perceptual metric but a very powerful addition to the toolbox to see a bit more directly what real non linear processes are occurring in real time with
any signal. I presented it to the IEEE and they were going to pursue but then I left the audio industry and I don't think it was picked up after I left.