Probably already said...
Let's deconstruct this a bit.
Blind testing means that the tester has no knowledge of which item is under test versus the control (or alternative) device or devices. Double-blind testing means that the person switching between those devices also has no knowledge of which item is selected. Blind testing is required, but double-blind testing ensures that the man behind the curtain isn't somehow revealing which is which (or even if there is or isn't a change in the device under test). Even the sound of a physical switch can be telling enough to violate that blindness, so testing technique is important and those conducting such tests spend a lot of time explaining and evaluating their technique.
Once the test is appropriately blind, a separate process is used to determine if the results are statistically significant. They must be clear enough to reject the reasonable possibility that the tester is lucky in his guessing. That requires lots and lots of repetition--far more than his kids entering through a different door and moving wires would tolerate unless his kids are different from any I've ever seen.
Blind, statistically significant testing is about subjective testing. I know that the word "subjective" is one we avoid, but the point of blind testing is not to make it objective, but rather to make it controlled. What makes it subjective is that it is measuring the user's response to the conditions under test. The correct term is "controlled subjective testing."
Objective testing, on the other hand, comprises measurements. Several things have to be true for measurements to be relevant:
1. The characteristic being measuree has to relate to something that affects what we hear,
2. The measurement has to evaluate that characteristic with usable accuracy, and
3. The result of the measurement has to have sufficient magnitude to affect the characteristic sufficiently to affect the subjective response. Endless arguments abound on this one point.
We are also challenged by the relationship between objective and controlled subjective testing. For me, the basic philosophy is that we understand how measured effects affect subjective responses. Thus, if someone claims subjective perception that violates that understanding, they carry a greater burden of proof of two important attributes of their claim. The first is that their perceptions are reliable, and for this we need controlled subjective testing as described above. (This is one of the thorny topics with audio. But if expensive stuff is better, then it is better because its designer was either smarter or more persistent, but there must be some relationship between the design the the performance that can be explained. Without that explanation, there is no reason to believe that anything can be designed. The designer must decide what to use based on some understanding of how that change might result in an improvement; without that, there is no basis for believing that any one designer can create more than one superior product. Quality becomes a matter of luck. This is nonsensical on the face of it, so the purveyors of snake oil don't even claim it. They instead come up with bogus science to explain the effect they claim, in the apparent expectation that even experts will take that science on face value.)
The second is that their measured effects are important, or that they form the basis for a preference of one versus the other. To evaluate importance, though, we have to agree on the outcome we want, and that will be a matter of familiarity and maybe taste, or, more importantly, of our listening objectives. Do we want it to sound like live music? Do we want it to be crystal clear in a noisy ambient environment? Do we want it to sound the way we remembered it from our youth? Do we want it to sound pleasing to us at some level of feelings that defy further deconstruction, even if the recording itself isn't recorded that way? Do we want it to be as absolutely faithful to the recording as possible? These objectives compete, but without understanding the objectives, the importance of perceptions (even those that can be reliably detected) cannot be applied to others. Much research has been aimed at modeling preferences as the empirical basis for determining the importance of a measurement. But there is a warning here, too. Preference models that are broadly based and statistically verified are good for manufacturers trying to hit the center of preference in the population, but they don't necessarily describe what you or I prefer. That takes us back to those objectives, whether they are stated or not. But if our preferences depart from broad preference testing, we should be prepared to both know and acknowledge that. Measurements and controlled subjective testing, however, can happily exist and be instructive separately from how closely it conforms to preference models.
Finally, when we read reviews that claim to perform measurements and controlled subjective testing, we have to believe that they are actually doing the things they say they are doing. In scientific research, test methods are clearly explained so that readers and reviewers of the work can have that confidence--it is never assumed on the basis of the researcher's reputation. And many who claim authority have demonstrated prior bias, and the burden on them to demonstrate their unbiased protocol increases as a result.
Here's the important point: The more the test results violate models that have been validated by measurements and controlled subjective testing, the greater the need for detail in describing the testing protocol. This is (or should be) common sense: If a witness in court is describing something that is inconsistent with what informed observers (i.e., the jury) accept as fact, the burden is put on the witness to back up their statements with more evidence. But even if testing verifies what is currently expected, it should be well enough described to eliminate or at least explain the biases inherent in it.
This ended up being a lot longer than I intended, but I think that demonstrates that it isn't quite as cut and dried as we often try to make it when we write "ABX!!!!!" Blind testing is not at all the whole story.
Rick "who reviews research reports for journals routinely" Denney