Because I plan to be doing a few more blind listening comparisons, I wanted to ask here for advise on methodology to maximize the generalize-ability of such results. You can read the results of my first blind test (where I used the methodology described below)
here where I recently compared KEF R3 vs Ascend Sierra 2EX.
However, I would greatly appreciate as much feedback and suggestions as possible to improving the methodology here! I want to make sure as much as possible is covered. And, if this thread goes well enough, maybe we can take the combined ideas here and make it into a sort of loose informal 'standard' for in-home blind test speaker comparisons.
For any blind listening test, the results are only as good as the control tests, the negative and positive controls.
The
Negative Control asks what fraction of listeners report, “I heard no difference” when there really was no difference in a test of speaker A vs. speaker A (or B vs. B). To state it conversely, the negative control measures how many false positive responses listeners made. Don’t assume there will be no false positives when you can measure how many there are. A blind listening test is a test of human perception. It isn’t an unfair trick to find out how many false positive answers are made. It’s very easy to do these negative controls, as no technical efforts are needed. But it does add to the number of blind listening trials listeners must sit through.
Each listener must be exposed to several such tests to determine what percent of his answers are false positives. For example, if a listener claimed to hear differences between A and B on 75% of his trials, but his False Positive rate was 45%, the real response rate would be 75% – 45% = 30%. (These are simply spitballed numbers, not the results of real tests.)
Similarly,
Positive Control tests are needed. These are tests designed to estimate how many negative responses are false negatives. A Positive Control would ask what fraction of listeners report, “I did hear a difference” where one really existed, or how many false negatives were there.
Good positive control tests are more difficult to create than negative controls. Imagine a digital copy of well-recorded music. On top of it, add varying amounts of digital white or pink noise, so the listener would hear the music with no added noise (0%) as well a series of increasing noise (such as 2.5%, 5%, 7.5%, and 10%). What fraction of the listeners hear added noise at each of those levels? Without some testing, I’m not sure what would make useful positive controls, but I hope this illustrates what I mean.
The Positive Control would provide answers to what subtle differences listeners actually could hear. It would also be an internal measure of just how good the entire measurement system is. This would include electronic gear, speakers, room acoustics, and variables among different listeners and variable responses from a single listener over time due to burn-out, fatigue, inattention, etc. If people repeat this positive control test at different times using different gear or listeners, these positive controls could work as internal standards allowing more meaningful comparisons among these different tests.
If suitable positive control tests are found, listening results can be further judged. How many listeners hear a difference for the positive control and how many fail would be an important measure of the effectiveness of the listening test apparatus and of variability among individual listeners. The fraction of listeners who meet these conditions, might be considered as a measure of validity for the whole listening test. Ideally, all listeners would hear a difference in the positive control and none would hear a difference in the A–A or B–B test. However, it is possible to deviate from the ideal and still make useful conclusions, as long as suitable controls are included for each listener to determine the frequency of false negative and false positive responses.