• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Relevance of Blind Testing

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,784
Likes
37,678
I used up/down testing for some threshold studies, and finding the "point of failure" is non-sequitur. How did you mean it?
Finding the point at which you fail to discern a difference. So the test is over when you reach a certain number of misses.
 

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,712
Likes
10,410
Location
North-East
Using failure as a stopping criterion is a form of significance-chasing or p-chasing, which has been the topic of several recent papers. It is important to set the number of trials before starting, unless using an established adaptive method with accepted stopping criteria (e.g. up/down, PEST, Quest or Psi), which is not applicable to ABX.
There can be several reasons to take a fatigue break, but responses (results) is not allowed. We use:
-subject requests a fatigue break, and
-a forced short break after a specific time (15-30 min, depending on task)
-a forced day or more break after a specific time or block of trials(1-2hrs, depending on task)

Correct me if I'm wrong, but the subject knowing the result immediately after each trial can also influence the final outcome (for example, by making the test taker anxious due to multiple failures). Better to keep the outcomes of the individual trials hidden until the end of the session, so as to not introduce yet another uncontrolled variable, no?
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,489
Likes
4,114
Location
Pacific Northwest
I could see that going either way: some people might want to see the results of each trial after they select, others might not want to see them until the end of the test. The test is valid either way. I see it as a personal choice. The goal is for the listener to feel relaxed and comfortable as possible. In my ABX testing app, I made this a user-selectable option.

PS: I'll add that you can certainly sum or aggregate individual tests, so long as each is correctly performed. In fact, this is the only way to gain (or disprove!) confidence from low raw scores. For example, 5 of 7 is only 77% confident (you have a 100 - 77 = 23% chance to do what well by guessing). But if you get 5 of 7 correct in 3 separate tests, that's equivalent to 15 of 21 which is 96% confident. Literally this could be called "long-term listening" but I hesitate to use that phrase because it is already used too often to mean something different.

However, if you always end the test on a failed trial, the test is invalid, so the above aggregation isn't valid either.
 
Last edited:

krabapple

Major Contributor
Forum Donor
Joined
Apr 15, 2016
Messages
3,197
Likes
3,768
It's possible to conduct a single experiment where you terminate the test based on the first failure.
The probability of 5 correct guesses followed by 1 wrong guess is (0.5)^5 * (0.5) = 0.0156. That might be considered statistically significant in itself. Let me repeat: this assumes you do a single sequence of guesses only. It's your first and only experiment.

We run into problems when you conduct multiple series of tests. You cant add together the numbers for the different tests if you've used failure as a stopping criterium. Intuitively: because you've limited the number of failures per test to 1, adding these 1s together doesn't mean anything.
But: you're also not allowed to do multiple series of tests and then calculate probabilities using just 1 of them. That would be cherry-picking your data.

Conclusion: combining the probabilities for the tests that were stopped using the first failure as criterion is more complicated then just adding the numbers. You're not allowed to calculate using just one of the test.

All of this assumes you recorded all your test series. You're never allowed to discard data after looking at the results. Judging from the loose description I doubt that was the case here.


Pio2001 (who also posts here) wrote a few nice best practices ABX posts years ago on Hydrogenaudio that have been stickied there -- including the issues of running multiple tests, when to stop, expectations, etc.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,489
Likes
4,114
Location
Pacific Northwest
Question for the stats folks: consider someone who ABX tests and gets 5 of 7. This is 77.3% confidence (22.7% chance to do that well by guessing). He does this in 3 separate/independent tests, on different days. We could aggregate this as 3*5 = 15 of 3*7 = 21 and compute 15 of 21 is 96.1% confidence (3.9% chance to do that well by guessing).

But we could tackle this differently: There are 3 different tests, each independent and having 23% chance to pass by guessing. If you pass all three, the probability you're guessing should be .23 * .23 * .23 = 1.2%. This would be 98.8% confident.

These numbers are different so they can't both be right. Which is correct?
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,784
Likes
37,678
Question for the stats folks: consider someone who ABX tests and gets 5 of 7. This is 77.3% confidence (22.7% chance to do that well by guessing). He does this in 3 separate/independent tests, on different days. We could aggregate this as 3*5 = 15 of 3*7 = 21 and compute 15 of 21 is 96.1% confidence (3.9% chance to do that well by guessing).

But we could tackle this differently: There are 3 different tests, each independent and having 23% chance to pass by guessing. If you pass all three, the probability you're guessing should be .23 * .23 * .23 = 1.2%. This would be 98.8% confident.

These numbers are different so they can't both be right. Which is correct?
Probably don't remember enough statistics to give the most correct answer. I do believe the answer is both are correct.

3 of 5 three consecutive times has different probability than 15 of 21 because there is more than one set of results that can become 15 of 21. For instance it could be 7 of 7, 7 of 7, and 1 of 7. Or it could be 2 of 7, 6 of 7 and 7 of 7. As you can see there are more ways to get 15 of 21 than there are ways to get 5 of 7 three times in a row.
 
  • Like
Reactions: TSB

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,489
Likes
4,114
Location
Pacific Northwest
But ultimately, only one answer can actually be correct. That is, if somebody passed 5 of 7 in each of 3 independent tests, how often does happen when guessing, or conversely, how confident are we that he wasn't guessing?

Intuition tells me the second approach is correct: 23% ^ 3 = 1.2% or 100 - 1.2 = 98.8% confident. But can't prove that, which means I could be wrong.

It's also easily extensible. Suppose he scores 5 of 7 on the first test (that's 77% / 23%), then 6 of 7 on the 2nd test (that's 94% / 6%), then 4 of 7 on the 3rd test (that's 50/50). So the net confidence is 1 - (.23 * .06 * .5) = 1 - 0.7% = 99.3%.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,784
Likes
37,678
But ultimately, only one answer can actually be correct. That is, if somebody passed 5 of 7 in each of 3 independent tests, how often does happen when guessing, or conversely, how confident are we that he wasn't guessing?

Intuition tells me the second approach is correct: 23% ^ 3 = 1.2% or 100 - 1.2 = 98.8% confident. But can't prove that, which means I could be wrong.

It's also easily extensible. Suppose he scores 5 of 7 on the first test (that's 77% / 23%), then 6 of 7 on the 2nd test (that's 94% / 6%), then 4 of 7 on the 3rd test (that's 50/50). So the net confidence is 1 - (.23 * .06 * .5) = 1 - 0.7% = 99.3%.

I think your last estimate is correct. How often does 5 of 7 three consecutive times occur when guessing, less often than other multiple combinations of three tests that get you to 15 of 21. Though not necessarily the least common. Least common is probably 1of 7, 7 of 7 and 7 of 7. In aggregate if all you care about is the 15 of 21 then we don't care about how many ways you can get there. Only when start breaking it down do you get these differences.

It also highlights the difference in a confidence value vs probability of a random result. All the confidence value tells you is how likely this result was from chance. If you add other conditions or look at specific patterns while eliminating others you'll have difference probabilities, but you should not get mislead by what that means. The odds of random 15 of 21 is one probability. The odds of 15 of 21 broken into 3 identical subsets is a less likely result by random chance. But it also isn't the same base question about getting 15 of 21.

So for our purposes the lower confidence level one gets from 15 of 21 is the one we are interested in. Whether it happened as three 5 of 7 or not is not really important to us.
 
Last edited:

SIY

Grand Contributor
Technical Expert
Joined
Apr 6, 2018
Messages
10,511
Likes
25,352
Location
Alfred, NY
If the two experiments were exactly the same, no changes other than time, then you can combine the scores to arrive at the final confidence.
 

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,408
Question for the stats folks: consider someone who ABX tests and gets 5 of 7. This is 77.3% confidence (22.7% chance to do that well by guessing). He does this in 3 separate/independent tests, on different days. We could aggregate this as 3*5 = 15 of 3*7 = 21 and compute 15 of 21 is 96.1% confidence (3.9% chance to do that well by guessing).

But we could tackle this differently: There are 3 different tests, each independent and having 23% chance to pass by guessing. If you pass all three, the probability you're guessing should be .23 * .23 * .23 = 1.2%. This would be 98.8% confident.

These numbers are different so they can't both be right. Which is correct?

In my non-expert opinion, @Blumlein 88 was going in the right direction in thinking of it in terms of distributions.

96.1% is the only correct answer, and it's to do with the fact that this is a binomial distribution. Once each trial has been run, the number of ways of distributing correct responses changes. Since the p-value is dependent on the distribution, it keeps changing after each trial. You therefore can't take the p-value (or its inverse) after x trials and multiply it by the p-value after x+y trials.

I'm not a statistician so I can't explain it more elegantly (or formally) than that, but the clearest way I can think of to say it is that, although each trial is independent, the probability that the subject was guessing after a given number of trials is not.

At some level, this also makes intuitive sense. If Subject A does ten tests in a row getting 7/7 in each test, the chances they are guessing when they get 7/7 in their eleventh test are surely lower than for Subject B who got an average of 3.5/7 in their first ten tests and then 7/7 in their eleventh test.

Anyway, I have a statistician relative I can talk to, so I'm gonna ask him and try to come back with a more rigorous answer...
 

TSB

Active Member
Joined
Oct 13, 2020
Messages
189
Likes
294
Location
NL
Question for the stats folks: consider someone who ABX tests and gets 5 of 7. This is 77.3% confidence (22.7% chance to do that well by guessing). He does this in 3 separate/independent tests, on different days. We could aggregate this as 3*5 = 15 of 3*7 = 21 and compute 15 of 21 is 96.1% confidence (3.9% chance to do that well by guessing).

But we could tackle this differently: There are 3 different tests, each independent and having 23% chance to pass by guessing. If you pass all three, the probability you're guessing should be .23 * .23 * .23 = 1.2%. This would be 98.8% confident.

These numbers are different so they can't both be right. Which is correct?

The first way is right. The distribution is binomial, (21 choose 15) * (0.5 ^ 7) ~= 4%. Note (21 choose 15) takes into account the different orderings your 15 successes can have in the 21 trials.

The second way is weird: you're calculating the odds of having at least 15 correct results AND having at least 5 correct in each of the three experiments. This is more specific, hence less likely.

tldr: The probability of getting 15/21 spread over three days by chance is higher than just the probability of getting the specific ordering (5/7, 5/7, 5/7).
 
Last edited:

SoundAndMotion

Active Member
Joined
Mar 23, 2016
Messages
144
Likes
111
Location
Germany
Finding the point at which you fail to discern a difference. So the test is over when you reach a certain number of misses.
That's an oversimplification of up/down methods and thresholds, at least as I know them, but I see what you meant. Thanks. Note that I mentioned up/down as one of the adaptive methods that doesn't start with a fixed number of trials and ABX is not an adaptive method.

Correct me if I'm wrong, but the subject knowing the result immediately after each trial can also influence the final outcome (for example, by making the test taker anxious due to multiple failures). Better to keep the outcomes of the individual trials hidden until the end of the session, so as to not introduce yet another uncontrolled variable, no?
Yes, although there are some psych experiments where giving feedback is part of the protocol, in psychophysics (incl. psychoacoustics) feedback can be a confound (as you suggest), and is normally avoided.
 

richard12511

Major Contributor
Forum Donor
Joined
Jan 23, 2020
Messages
4,337
Likes
6,708
The main problem of blind testing that it proves every DAC and amplifier sounds the same. So we can eradicate this site and abandon our hobby.

The fact that all electronics more or less sound the same is the best thing I've ever learned in this hobby, as it allows me to budget more money for different loudspeakers(which do sound very different).

So no need to end the hobby just because electronics sound the same. Just shift the priorities of the hobby :)
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,489
Likes
4,114
Location
Pacific Northwest
I think your last estimate is correct. How often does 5 of 7 three consecutive times occur when guessing, less often than other multiple combinations of three tests that get you to 15 of 21. ... So for our purposes the lower confidence level one gets from 15 of 21 is the one we are interested in. Whether it happened as three 5 of 7 or not is not really important to us.
In my non-expert opinion, @Blumlein 88 was going in the right direction in thinking of it in terms of distributions.
96.1% is the only correct answer, and it's to do with the fact that this is a binomial distribution....
The first way is right. The distribution is binomial, (21 choose 15) * (0.5 ^ 7) ~= 4%. Note (21 choose 15) takes into account the different orderings your 15 successes can have in the 21 trials.
...
What bothers me about this approach is that it's not consistent with Bayes rule. Getting 5 of 7 correct is a test that one is 22.7% likely to pass by guessing. If you take the test twice and pass both times, these are independent events so Bayes rule says that is .227 * .227 = 5.2% likely. You guys are saying this is like getting 10 of 14 correct which is 9% likely.

If what you guys are saying is true, how do we explain or understand how it contradicts Bayes rule?
 

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,408
What bothers me about this approach is that it's not consistent with Bayes rule. Getting 5 of 7 correct is a test that one is 22.7% likely to pass by guessing. If you take the test twice and pass both times, these are independent events so Bayes rule says that is .227 * .227 = 5.2% likely. You guys are saying this is like getting 10 of 14 correct which is 9% likely.

If what you guys are saying is true, how do we explain or understand how it contradicts Bayes rule?

I think the best answer is the one @Timon VDB gave. There's no need to delve into conditional probability here.

However, the way I tried to explain it in my previous post was not through Bayesian probability. It's actually about each subjects inherent aptitude. In the example I gave, subject A has scored 70/70 and subject B has scored 35/70 in the first 10 "tests".

As you point out, it's not correct to say that subject A has a higher chance of getting 7/7 in the next test given that they got 7/7 in the last ten tests. Rather, we would say that subject A's chance of succeeding in any trial is (and always was) higher than subject B's.

(It's likely slightly more complex IRL ofc, because undertaking the test itself may constitute a form of training. But that's the general idea.)

Again, apologies to any statisticians out there for what is no doubt butchery of your field ;)
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,489
Likes
4,114
Location
Pacific Northwest
I think the best answer is the one @Timon VDB gave. There's no need to delve into conditional probability here. ...
We can't escape conditional probability because we are talking about the probability of 2 independent events. Therefore, while we have freedom to ignore Bayes rule and tackle the problem in other ways, whatever end result we compute must be consistent with Bayes rule.

Problem is, both approaches here seem intuitively correct. But they disagree, so at least one of them must be wrong. Bayes rule is simpler, so that is where I lean. I'll ponder this more today and see if I can figure it out definitively.
 

PierreV

Major Contributor
Forum Donor
Joined
Nov 6, 2018
Messages
1,449
Likes
4,818
What bothers me about this approach is that it's not consistent with Bayes rule. Getting 5 of 7 correct is a test that one is 22.7% likely to pass by guessing. If you take the test twice and pass both times, these are independent events so Bayes rule says that is .227 * .227 = 5.2% likely. You guys are saying this is like getting 10 of 14 correct which is 9% likely.

If what you guys are saying is true, how do we explain or understand how it contradicts Bayes rule?

Mixing p values and Bayes is very very tricky imho, especially on small sample sizes.

https://www.annualreviews.org/doi/full/10.1146/annurev-statistics-031017-100307#_i33
 

andreasmaaan

Master Contributor
Forum Donor
Joined
Jun 19, 2018
Messages
6,652
Likes
9,408
What bothers me about this approach is that it's not consistent with Bayes rule. Getting 5 of 7 correct is a test that one is 22.7% likely to pass by guessing. If you take the test twice and pass both times, these are independent events so Bayes rule says that is .227 * .227 = 5.2% likely. You guys are saying this is like getting 10 of 14 correct which is 9% likely.

Yes, but the probability of passing two tests by guessing is lower than the probability of passing one test (with twice the number of trials) by guessing.

That doesn't seem inconsistent to me.

Replace it maybe with a more natural example. You're shooting hoops and you define succeeding on Test A as shooting 5 out of 7 hoops, while succeeding on Test B is defined as shooting 10 out of 14 hoops.

You do test A twice, and score 3/7 the first time and 7/7 the second time. You've passed Test A once (out of two attempts). However, if you aggregate the results, you've passed Test B.

Your chances of passing Test B are higher than your chances of passing Test A twice.
 
  • Like
Reactions: TSB

TSB

Active Member
Joined
Oct 13, 2020
Messages
189
Likes
294
Location
NL
We can't escape conditional probability because we are talking about the probability of 2 independent events. Therefore, while we have freedom to ignore Bayes rule and tackle the problem in other ways, whatever end result we compute must be consistent with Bayes rule.

Problem is, both approaches here seem intuitively correct. But they disagree, so at least one of them must be wrong. Bayes rule is simpler, so that is where I lean. I'll ponder this more today and see if I can figure it out definitively.
The second approach correctly calculates the probability of having the exact outcome (5/7, 5/7, 5/7). That just doesn't have any meaning for the experiment. There is no contradiction.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,489
Likes
4,114
Location
Pacific Northwest
Great example. It intuitively explains why 10 of 14 is easier to get through luck, than 5 of 7 twice. If you pass 5 of 7 twice, your aggregate score can't be lower than 10 of 14. But you can score 10 of 14 without passing 5 of 7 twice (you could get 4 of 7 then 6 of 7, or 3 of 7 then 7 of 7).

Yet this suggests that when computing the overall confidence of a series of tests, we should use Bayes rule, because we know the subject passed each of the shorter tests. He didn't fail some then make it up by doing better on others.
 
Top Bottom