well if the problem is not having a proper "level matched setup", then tester makes one test set with as "level matched" as possible, the one that was worst (more than marginally), can be set slightly louder/quieter (maybe pref. is quieter instead, so both ways..), that way the worst gets it chance and the result would rebut level differences as argument for the first test, the result concludes that, there was a difference heard in the first test. <- these tests has to be blind for the tester.
just saying..there are options around "properly level matched" argument for regular ppl with just a mic (not perfect but close enough..)
[EDIT] just realized, louder/quieter test cannot both be included for assesment (bc one of them would favor the winner), so worst performing of the two would be the control, to evaluate if one still prefers the winner of "level matched" test, ...it is doable, just 3x vs. 1x times the test and blind of course.
[EDIT2] just realized

, the argument isn't which one is better at all (that subjective), the question should be: is there a difference, so in "level matched" setup, tester should succesfully distinguish 7/10 of the two, then do louder/quieter tests, pick the worst performing test as control: if the worst performing test got 7/10, it should say, tester was able to distinguish two devices and they were sounding different in "level matched" setup. just food for thought, and testing can be fun.