Nice investigation!
I did a bunch of experiments on this a while back and the outcome was basically this:
The influence of group delay (and its equivalent phase response) on perception is multi-faceted.
There is a time-domain effect, excessive GD at low freq makes "the bass lag behind", eg smearing the "compactness" of bass drums and plucked upright string bass, ....
.... and there is also a timbre shift for quasi steady-state notes when they cointain "even order distortion" overtone profile with the proper phase so that the waveform is highly asymmetrical. It turns out that we seem to be sensitive to actual waveshape at low frequencies and shifting around phases of the harmonics by additional phase contribution from crossovers and bass alignments impacts the percieved timbre. Flipping the absolute polarity can have a very similar effect on timbre for its likewise apparant effect on harmonics phases.
A test signal for this can easily be constructed: take a eg 80Hz fundamental, then add equal level 2nd harmonic but de-tuned by 0.5Hz, so 160.5Hz.
This gives two distinct and stable lines in an FFT plot but perception tells, beside the slightly out-of-tune harmonic, there is a cyclic slight change of timbre at a 2 seconds rate going from "fatter" to "leaner" and back, exactly linked to a realtime oscilloscope display of the waveform where we can see the 2nd harmonic "cycling" slowly on top of the fundamental. The nice feature of this test that it works with any speaker or headphones(preferred) no matter what their phase response is. Low distortion is required, though, therefore headphones at low signal levels should be preferred for listening.
The de-tuning makes it not a good choice for an actual ABX. For this, a set of files is required, with the exactly pitched H2 content aligned in 30° or 45° steps vs the fundamental, through the whole 360° cycle. There will be one specific set -- depending on playback system total phase -- of offsets spaced 180° apart that will show the strongest timbre shift when compared, and another 180° set, this time rotated by 90° against the first which shows almost no timbre change when switched. The "strong" set can then be successfully used in ABX.
A "semi-scientific" paper somewhat relevant in this context:
https://www.researchgate.net/public...akes_auditory_sense_a_new_paradigm_in_hearing