OK, I'll try to explain my issue with "preference tests". I will have to be a little long, so please bare with me.
I took a seminar in graduate school with the "inventor" of conjoint analysis. Paul Green was a systematic researcher that explored "mathematical" modeling of consumer choices. Some of his earlier work had to do with consumer choices in airplane travel. For example, consumers preferred on time arrival over seat width. He could "measure" preferences and called them utilities. You can find out about the techniques rather easily.
During the seminar, he presented probably the first project of utility (preference) measurement for the pharma industry. He was comparing the different "attributes" of two types of compounds for hypertension. One was called beta blockers (still used) and the other was for a new type of compound called ACE inhibitors. BBs were the most commonly used ($$$) at the time (unless you include diuretics but it doesn't matter now). BBs had (have) undeniable and conclusive data showing that they reduced the chances of a heart attack or sudden death in people taking them. But, one of the most "insidious" side effects was impotence. The incidence was not very high, but the popularity of it was the issue. BBs also had better BP control than the ACEi he was consumer testing. The ACEi did NOT have data on CV mortality and had a slightly less efficacious control of BP than BBs.
The testing was done with physicians and with patients. The model showed that both physicians and patients preferred ACEi over BBs because the "immediate" risk of impotence was much more important than the "eventual" risk of a heart attack. The "negative utility" or "preference" for impotence was more valuable than the "positive utility" or "preference" for improved survival. If you look at it rationally or logically, this wouldn't be the case, but this is what the study of preference showed and the actual marketing of the drugs demonstrated as well. A majority of hypertensive patients are men over the age of 50. "If your dog won't hunt, you didn't care about being alive". ACE inhibitors and other similar agents completely took over the treatment of HBP. BBs are now reserved for patients that already had a heart attack. And we have a blue pill to deal with the hunting pooch.
So, in this case, personal preference, in repeated exercises, in different groups, showed that "irrational" choices or preferences were more important to "rational" ones.
Many years later, I read Kahnemann who demonstrated that humans are not rational. I was never able to find out if Paul Green knew about Kahnemann's (and Tversky's) work. The overlap is uncanny. People like Ariely at Duke now do a lot of work based on irrationality of people's decisions. I wonder if fancy cables conclusively demonstrate what Tversky used to say: That he studied people's stupidity.....
It MAY be that Toole tested preference and not accuracy. Nothing wrong with that. But, if they used a string quartet, for example, to test speaker preference, then bass response should not be part of the decision. If you incorporate music with bass, then they demonstrated that bass will impact the overall choice or preference.
I would want to repeat Toole's choices with music that had no bass and then show if his preference stands. Clearly, it can't. But then you would know what other factors drive people to prefer one speaker over the other. In the "regression model" that is preference, the "utility" provided by bass will overwhelm almost all other parameters. At least it seems that way!
@b1daly shows exactly the problem here. Preference may not equate to accuracy. In "amplified" music, knowing how the end product was mastered is important. b1daly shows that if one listens to old masters with modern accurate monitors you may have a problem. What would happen if someone picked the Beatles original masters and remastered them using modern accurate monitors? Probably, we would have a problem with them as we would say that this is not how they "sounded"! So, for amplified music, having a standard monitoring curve seems perfectly reasonable to me. The consumer could then match their sound system to the"Harman" curve.
But for classical music, the Harman preference may not match the actual frequency distribution of the music in the playing space! I don't know if this is the case, but I am pretty sure that audience preference at the experimental site without a "live music" control can create problems. It would be the "active control" of experiments. You would not be controlling Speaker A vs B vs C but also vs "Live music".
You can have a preference choice that is not accurate or rational.
Sorry for the digressions. Hope you can think it over.