Because I plan to be doing a few more blind listening comparisons, I wanted to ask here for advise on methodology to maximize the generalize-ability of such results. You can read the results of my first blind test (where I used the methodology described below) here where I recently compared KEF R3 vs Ascend Sierra 2EX.
However, I would greatly appreciate as much feedback and suggestions as possible to improving the methodology here! I want to make sure as much as possible is covered. And, if this thread goes well enough, maybe we can take the combined ideas here and make it into a sort of loose informal 'standard' for in-home blind test speaker comparisons.
Below I have listed in detail the methodology I used for that test.
Setup:
However, I would greatly appreciate as much feedback and suggestions as possible to improving the methodology here! I want to make sure as much as possible is covered. And, if this thread goes well enough, maybe we can take the combined ideas here and make it into a sort of loose informal 'standard' for in-home blind test speaker comparisons.
Below I have listed in detail the methodology I used for that test.
Setup:
- Place both speakers as close to each other as possible, spaced equally from the wall behind it.
- Level-match both speakers precisely, to avoid one speaker from winning due to playing louder than the other.
- Integrate a subwoofer crossed over at 100hz to factor out differences in bass extension preference, since Dr. Toole's research indicates that ~30% of speaker preference is is determined just from bass extension capability. It would be interesting to get other thoughts here, though.
- Compare one song at a time, moving on to the next song only when the listener conclusions for one song have been recorded.
- For each new song, the first speaker played is called "Speaker A" and the second "Speaker B" from the listener's perspective, but the actual assignment to which of the two speakers being compared is randomized here to prevent the possibility of a cumulative bias from developing in the listener that voids the statistical independence of each song test.
- When comparing a single song, a sub-interval (usually ~30 seconds) of the song is played on "Speaker A" then "Speaker B". In my past test I allowed the listener to request replaying the same segment if they wish. Then I proceed to the next ~30 seconds or so played on each speaker. The test ends when the listener indicates they are done (either due to forming a confident preference, or deciding they have none).
- When switching speakers, I delay around 10 seconds at least between resuming playback. This is meant to prevent the switch from having an obvious audible positional shift of the sound source location, since each speaker is in a slightly different position (side by side).
- Once the listener has completed the evaluation of both speakers, the following questions are asked:
- Which speaker (if any) did you prefer, for this song?
- Can you explain the differences you heard between Speaker A and Speaker B?
- [When clarification is necessary due to use of ambiguous terms:] Can you explain what you mean by [word/phrase]?
- The latter of these three questions is never to be asked more than (3) times.
- This is meant to reduce any possibility of expectation bias inflicted by the administrator of the test by varying the number of questions asked (e.g. asking more questions to subconsciously try to invert the results, or something like that). Beyond this single degree of freedom (whether to ask 0, 1, 2, or 3 questions), the test administrator has no other possible influence that I can think of that would be removed if the test were double-blind (double-blind would be nice, but often impractical with limited resources).
- This 'interview' portion is transcribed or otherwise recorded, to be compiled in the test results.
- Of course, the more participants in any such test, the better. Though Dr. Toole's research shows that most humans appear to have the same speaker preferences, there is always a chance that with only N=1 (where N = number of listeners) you'll get unlucky and have someone whose hearing preference deviates significantly from the mean -- even if the probability of that being the case is low. Please correct me if I'm wrong though.
- Room choice? I'm not sure, but it seems an "average" or "good" room should be preferred; not necessarily one with extensive expensive treatments, and not necessarily one with horrible echo/reflection problems. Other than that, I don't think there's any need to replicate tests in more than one room for the results to be valid, according to Dr. Toole's research here. Please correct me if I'm wrong, though.
Last edited: