You say you want to test using single speaker because it makes tonality more obvious, but then you say you can equalize to make one sound like another or to correct problems. That means that frequency response is trivial because of the many DSP technologies. At the same time you don't talk at all about spatial response, which is the biggest variable among speakers and cannot be heard in single speaker comparisons. Explain.Members here are no stranger to the battle between trusting measurements versus listening tests as performed by reviewers at large. It is a difficult topic to try to address in text so a while ago I decided to create a presentation and video for it. It was a harder job than I thought but finally managed to create a cohesive presentation based on research. I go through the formal research on how listening tests are performed and correlated with measurements.
It is a long, 1+ hour presentation but hopefully you find it worthwhile to set aside that much time to watch it (or speed it up).
Research papers:
https://www.aes.org/e-lib/browse.cfm?elib=9822 "
A Survey Study of In-Situ Stereo and Multi-Channel Monitoring Conditions
https://www.aes.org/e-lib/browse.cfm?elib=12206
Differences in Performance and Preference of Trained versus Untrained Listeners in Loudspeaker Tests: A Case Study