I am rethinking about what I want to test, and I think ABX testing is not what I want to do.
I want to know the threshold of audibility of the group delay response (and any other effects like FR deviations) for a variety of crossover filters. I am able to put forward several choices that the listener can pick from or rank. For example, WRT the purported or assumed audibility threshold I can present N versions of a short audio track with N-1 different levels of group delay plus the original audio. The subject can play the audio corresponding to each one as many times as they would like. The subject is asked to choose ALL options that seem to be unaltered audio and leave any that seem altered unchecked. The options would be listed in a random order, however, I will of course know which is which.
After many trials by various subjects I should obtain a sort of histogram that represents the cumulative voting on the limit of audibility for the crossovers that were auditioned, and for the particular audio selection. Maybe something like this (0=unaltered audio, 10=audio with audible effects from group delay, 1 to 9 = increasing levels of group delay, the y-axis is the fraction of the test subjects who selected each audio track 0..10 as sounding "unaltered"):
This is in a way like having the user perform N AB tests in parallel.
The next and most important question is how to analyze (statistically) the historgram data? It seems to me that the histogram that I will obtain might be similar to a cumulative binomial probability distribution. If we assume those sorts of descriptive statistics apply, I can just convert it to the binomial form and then find the mean and calculate the confidence limits of the mean, or similar statistics. I would be able to provide the group delay corresponding to the mean. This would be reported as the GD audibility threshold for the particular audio track used in the test. Using a more statistical approach, given the number of times the test was taken I can calculate the fraction that should be required for 95% confidence and then see where the cumulative distribution crosses that value. This would be the audibility limit to 95% confidence.
This could be repeated for several audio tracks, since different sorts of sounds supposedly influence the audibility threshold. I would create one web page for each test, and then could easily add new tests (each test will have a new and different audio track) at any time. The user does not get any immediate feedback from their trial such as a score, but then this might also discourage users from trying to go back and get a better score, etc.
What I like about this approach is that the user is in control of how many times they must listen to the audio before making their selections.
Please share your thoughts on this approach and my assumptions about the statistics! Thanks