When I conducted a double blind test to see if listeners can detect the presence (vs. the absence) of an additional ADC/DAC double conversion in the signal path there were about 8 people involved, two to conduct the mechanics of the test with 6 experienced listeners (or maybe 5, it was years ago). After playing with the setup in order to define optimal parameters for the experiment we settled on the following (not really what I would do if I were comparing speakers, though!):
-- length of the stimulus had to be 10 to 20 sec long. Any less, not enough information, any longer, and you cannot remember it
-- replay the same chunk, rather than skip from one version to the other as the track continues to play
-- It was a detection test - two stimuli which could be w-w, w-wo, wo-w and wo-wo, and the subjects had to guess if the stimuli were the same or different, presented in random order
Ultimately the conclusion was that nobody could tell the difference. Never mind preference.
Speakers are much more different, and easier to tell apart. Still, what you are proposing is so subjective in a very specific setting, that unless you can do something to remove (as much as possible) the influences of the environment, it would be virtually worthless to me. Much more useful to the owner of the test location
I guess that if you applied the same Dirac target curve to all of the speakers, then the comparison would be much more valuable to a general audience. These kinds of subjective comparisons only periferaly have anything to do with fidelity, which is what reduces the value of any conclusion for any single individual.