• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Could someone help me to think through my ABX result using Bayesian reasoning?

sonitus mirus

Active Member
Joined
Apr 19, 2021
Messages
272
Likes
360
As I stated in the OP:

1. For many years, I'd been certain that I could hear clear differences between bit-identical playback means (e.g., different USB cables, streaming vs. local playback, different software players and their bit-identical settings, etc.) - something that has bugged me no end over the years


Mani.

These are extraordinary claims.
 

PierreV

Major Contributor
Forum Donor
Joined
Nov 6, 2018
Messages
1,449
Likes
4,816
If my prior was truly 0.5, the test would never have taken place. I can't state it more clearly.
Fair enough, but let me rephrase that again

If your expectations were high going into the test, stick to a pure frequentist approach, as others have suggested. You have a chance to prove your expectations. You can't bias (provide a good testing protocol is in place) a pure observational frequency test, but you are definitely biasing a Bayesian test with a high prior. Any Bayesian "proof" will be weaker than a pure frequentist "proof".
 

Sgt. Ear Ache

Major Contributor
Joined
Jun 18, 2019
Messages
1,895
Likes
4,162
Location
Winnipeg Canada
If my prior was truly 0.5, the test would never have taken place. I can't state it more clearly.

What does this even mean?? If your prior was truly .5, nobody would even care if you took the test because you wouldn't be going on audio forums claiming you could hear the difference between USB cables right? You're saying the fact that you took the test proves that you could actually do that because you wouldn't even have taken the test if you couldn't?
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
Could we get another thread dedicated to the idea of how Bayesian statistical thinking could be applied to audio testing. One starting from scratch and not involving an already done test. I know Mani's intent was something like this, but it has been polluted by other issues already.

I'd be interested in that if any of you here are well versed in using Bayesian statistics and testing. I sort of understand the idea, but it often seems impractical to use.

Bayesian is a more complete theory about such things, and frequentist methods (what most people know such as p values and such) are a subset of such probability theories.

So are there any Bayesian experts on the forum who could help out?
 

Killingbeans

Major Contributor
Joined
Oct 23, 2018
Messages
4,096
Likes
7,572
Location
Bjerringbro, Denmark.
My prior was high, not because of "my preconceived ideas about myself", but because of my experience over many years.

That's the kind of reasoning that will get you caught in an endless loop of confusion.

The test is meant to establish whether or not your experience was caused by the playback system. Trying to use it as a simple verification of your experience will get you nowhere.

1. For many years, I'd been certain that I could hear clear differences between bit-identical playback means (e.g., different USB cables, streaming vs. local playback, different software players and their bit-identical settings, etc.) - something that has bugged me no end over the years

You keep underlining the word 'certain' as if it can legitimize an assumption. People who put magic crystal on top of their gear are also 100% certain that it makes a huge difference to the sound. What makes their certainness less potent than yours?
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,484
Likes
4,110
Location
Pacific Northwest
Could we get another thread dedicated to the idea of how Bayesian statistical thinking could be applied to audio testing. One starting from scratch and not involving an already done test. ...
So are there any Bayesian experts on the forum who could help out?
I'm not an expert, just a practitioner. Some people here, especially the technical experts, have done statistical analysis that might benefit from or use Bayesian methods, others where they may have had good reasons to avoid Bayesian methods. I think it would be an interesting and educational conversation and may produce useful take-aways.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
I'm not an expert, just a practitioner. Some people here, especially the technical experts, have done statistical analysis that might benefit from or use Bayesian methods, others where they may have had good reasons to avoid Bayesian methods. I think it would be an interesting and educational conversation and may produce useful take-aways.
Well if we have even a few practitioners here such a thread would be educational I think. So anyone other than MRC01 who could usefully take part in such a thread?
 

Killingbeans

Major Contributor
Joined
Oct 23, 2018
Messages
4,096
Likes
7,572
Location
Bjerringbro, Denmark.
Which ESP-believer in his 'right mind' (obviously not!) would do this? They simply wouldn't, because the chances of them succeeding are so slim... unless they really had ESP, in which case they know their chances are high.

You forget that in their minds they are 100% convinced that they have ESP. They do not need to actually have ESP in order to have absolute confidence in their chances being high. James Randi tried to make this crap clear to people for several decades, but I fear that he didn't even make a dent.
 

JRS

Major Contributor
Joined
Sep 22, 2021
Messages
1,158
Likes
1,007
Location
Albuquerque, NM USA
Could we get another thread dedicated to the idea of how Bayesian statistical thinking could be applied to audio testing. One starting from scratch and not involving an already done test. I know Mani's intent was something like this, but it has been polluted by other issues already.

I'd be interested in that if any of you here are well versed in using Bayesian statistics and testing. I sort of understand the idea, but it often seems impractical to use.

Bayesian is a more complete theory about such things, and frequentist methods (what most people know such as p values and such) are a subset of such probability theories.

So are there any Bayesian experts on the forum who could help out?
No expert here, what little I do I know is that the math is used in algorithms of all sorts where decision making is involved (AI), but I was unaware of it's use where belief in an outcome is an experimental variable. So that's interesting.

Where I have seen is usually more cut and dried, such as what if we did angiography on all adults over 45, how many would show narrowing of the arteries, vs if we look at the subpopulation of those over 45 complaining of moderate intermittent chest pain, and do angiography on all of them, how many show narrowing of the arteries?

This allows quantifying the degree of risk of atherosclerosis associated with moderate intermittent chest pain. Knowing that is of great value. Then you may look at the number of unexpected fatalities because of the procedure among those otherwise healthy individuals who have chosen to have an elective procedure done versus the value of knowing of atherosclerosis earlier and extending their lives. The expected value may turn out to be a negative in which case, there would need to be more inclusion variables before that intervention, all of which could be sorted using Bayesian analysis.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
No expert here, what little I do I know is that the math is used in algorithms of all sorts where decision making is involved (AI), but I was unaware of it's use where belief in an outcome is an experimental variable. So that's interesting.

Where I have seen is usually more cut and dried, such as what if we did angiography on all adults over 45, how many would show narrowing of the arteries, vs if we look at the subpopulation of those over 45 complaining of moderate intermittent chest pain, and do angiography on all of them, how many show narrowing of the arteries?

This allows quantifying the degree of risk of atherosclerosis associated with moderate intermittent chest pain. Knowing that is of great value. Then you may look at the number of unexpected fatalities because of the procedure among those otherwise healthy individuals who have chosen to have an elective procedure done versus the value of knowing of atherosclerosis earlier and extending their lives. The expected value may turn out to be a negative in which case, there would need to be more inclusion variables before that intervention, all of which could be sorted using Bayesian analysis.
See I have enough understanding of it that I get why it works in the example you explained. Now is there a reasonably good use of it for the situation where we try and answer the question, "Do two DACs with measurements we expect are audibly transparent nonetheless sound different?"

Many of us here would have a prior in mind saying they do not sound different, while Mani will say, his prior, from experience is they do sound different. Is bayesian reasoning of the problem any help and what is the proper way to approach this question? Actually I would expect bayesian reasoning properly applied would lead to the correct answer for both groups whatever the real answer is. And the answer would be the same for both groups, but with different priors the steps to get to the answer would be different.
 

earlevel

Addicted to Fun and Learning
Joined
Nov 18, 2020
Messages
550
Likes
779
I got 9/10 in an ABX. That's a 99% probability that I was actually hearing something.
I still have a lot of catching up on this fast moving thread, but I have to point out that not what statistical significance means.

Clearly, if you got 10 out of 10, it's not 100% certainty that you did not guess. If everyone on this board, each morning, took out a coin and called "heads", then flipped it 10 times, at some point we'd have someone with all 10 tosses "heads". Maybe even on the first day. It doesn't mean they guessed correctly, they almost certainly didn't. But, if we polled everyone's results each day, we'd expect a Gaussian distribution.

I find it common to misunderstand what confidence mean. A friend reported getting 20 of 30 listening tests correct, he said that was very near the "golden" 95% confidence (he said he didn't want to test any more and ruin his score). In this case, it was a test of hearing 24-bit dithered versus truncated audio. More recently, I gave him a tone test at receding levels, which he found difficult to hear beyond -90 dBFS, which further casts doubt on his (anyone's, of course) ability to hear ~-140 dB in the presence of normal music levels.

So, yes, 9 out of 10 is very attention-getting—I agree that "the 9/10 result shouldn't be easily dismissed as a simple fluke". The problem is, it's not conclusive either.

Which brings up a second problem with these kinds of tests. If we're giving such a test to a large group to find out what the general ability of a population is, then it makes sense to give everyone 10 tries. But if we're trying to find out whether you can really pick the right choice, we need more than 10.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
I still have a lot of catching up on this fast moving thread, but I have to point out that not what statistical significance means.

Clearly, if you got 10 out of 10, it's not 100% certainty that you did not guess. If everyone on this board, each morning, took out a coin and called "heads", then flipped it 10 times, at some point we'd have someone with all 10 tosses "heads". Maybe even on the first day. It doesn't mean they guessed correctly, they almost certainly didn't. But, if we polled everyone's results each day, we'd expect a Gaussian distribution.

I find it common to misunderstand what confidence mean. A friend reported getting 20 of 30 listening tests correct, he said that was very near the "golden" 95% confidence (he said he didn't want to test any more and ruin his score). In this case, it was a test of hearing 24-bit dithered versus truncated audio. More recently, I gave him a tone test at receding levels, which he found difficult to hear beyond -90 dBFS, which further casts doubt on his (anyone's, of course) ability to hear ~-140 dB in the presence of normal music levels.

So, yes, 9 out of 10 is very attention-getting—I agree that "the 9/10 result shouldn't be easily dismissed as a simple fluke". The problem is, it's not conclusive either.

Which brings up a second problem with these kinds of tests. If we're giving such a test to a large group to find out what the general ability of a population is, then it makes sense to give everyone 10 tries. But if we're trying to find out whether you can really pick the right choice, we need more than 10.
This is a good point often overlooked. I've sometimes setup a simple spreadsheet that generates random yes and no answers. Then let it run however many times I want it to. You do indeed get runs of 10 of 10 right and just about at the predicted level by simple statistical distributions. I know by design the results are random.

Still 10 of 10 is not common at random. Roughly 1 in a 1000. 20 of 20 is 1 in a million. So I do consider 10 of 10 very convincing. You have to be fair about it otherwise you can't make any use of such testing. In such testing doing more than 10 or 20 becomes incredibly tedious. So while you cannot say 10 of 10 is proof of something definitively, it is pretty convincing.

Up/down testing of thresholds is much better and less tedious. But you have to have a variable you can precisely manipulate for that to work. For instance you can find your threshold for distortion this way. It doesn't work when someone says, "I can hear these two DACs aren't the same" when measurements show both should be below the level of audibility. Because you are at loss of what is changed between them.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,484
Likes
4,110
Location
Pacific Northwest
... is there a reasonably good use of it for the situation where we try and answer the question, "Do two DACs with measurements we expect are audibly transparent nonetheless sound different?"

Many of us here would have a prior in mind saying they do not sound different, while Mani will say, his prior, from experience is they do sound different. Is bayesian reasoning of the problem any help and what is the proper way to approach this question? ...
@JRS 's example illustrates that the Bayesian prior must be supported by evidence or observation, which makes it quantifiable. If different people don't agree on the prior, that suggests the evidence or observation is too weak or inapplicable to be a valid prior.

How do we apply this in your example of the DACs? Two ways come to mind, both related to how software companies apply Bayesian methods when AB testing changes to web sites & applications.

1. From listeners having performed past ABX tests, incorporate those past results into a probability distribution reflecting their acuity or skill.

2. From all the DACs you have previously ABX tested, incorporate the DAC measurements (or differences between them) into a probability distribution reflecting their expected likelihood to be differentiated.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
@JRS 's example illustrates that the Bayesian prior must be supported by evidence or observation, which makes it quantifiable. If different people don't agree on the prior, that suggests the evidence or observation is too weak or inapplicable to be a valid prior.

How do we apply this in your example of the DACs? Two ways come to mind, both related to how software companies apply Bayesian methods when AB testing changes to web sites & applications.

1. From listeners having performed past ABX tests, incorporate those past results into a probability distribution reflecting their acuity or skill.

2. From all the DACs you have previously ABX tested, incorporate the DAC measurements (or differences between them) into a probability distribution reflecting their expected likelihood to be differentiated.
That gets to what is a proper prior. I understand like medical situations where you have plenty of prior data to work with. Or if we had lots of ABX testing of DACs. We don't in the case of DACs. You have proxies like testing of distortion, FR, and other factors.
 

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,484
Likes
4,110
Location
Pacific Northwest
I still have a lot of catching up on this fast moving thread, but I have to point out that not what statistical significance means.

Clearly, if you got 10 out of 10, it's not 100% certainty that you did not guess.
Even a blind squirrel sometimes finds a nut, but he does less often.

The binomial distribution says you'll get at least 9 of 10 boolean questions correct by random guessing only 1.07% of the time (or flipping coins). Which is 100 - 1.07 = 98.93% confidence that you weren't guessing. 10 of 10 is 99.9%.

Even short tests can be significant: 5 of 5 is 96.9% confident, which beats the standard 5% chance to be guessing.
If you get one wrong, you need to go to 8 trials: 7 of 8 is 96.5% confident.

I find it common to misunderstand what confidence mean. A friend reported getting 20 of 30 listening tests correct, he said that was very near the "golden" 95% confidence (he said he didn't want to test any more and ruin his score).
He was correct, actually slightly better. 20 of 30 is only 4.94% likely by guessing, or 95.06% confidence.

Confidence levels are a double-edged sword. Higher levels increase precision at the expense of recall; it reduces false positives yet increases false negatives. For example, with a 99% threshold the guy who gets 9 of 10 correct fails (just barely). But it's far more likely that he was hearing a real difference, than he was guessing. So we typically use 95% confidence as a simple compromise. You can go higher and mitigate the risk of false negatives by using additional trials, but this requires multiple separate tests to avoid listener fatigue impairing the results.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
Even a blind squirrel sometimes finds a nut, but he does less often.

The binomial distribution says you'll get at least 9 of 10 boolean questions correct by random guessing only 1.07% of the time (or flipping coins). Which is 100 - 1.07 = 98.93% confidence that you weren't guessing. 10 of 10 is 99.9%.

Even short tests can be significant: 5 of 5 is 96.9% confident, which beats the standard 5% chance to be guessing.
If you get one wrong, you need to go to 8 trials: 7 of 8 is 96.5% confident.


He was correct, actually slightly better. 20 of 30 is only 4.94% likely by guessing, or 95.06% confidence.

Confidence levels are a double-edged sword. Higher levels increase precision at the expense of recall; it reduces false positives yet increases false negatives. For example, with a 99% threshold the guy who gets 9 of 10 correct fails (just barely). But it's far more likely that he was hearing a real difference, than he was guessing. So we typically use 95% confidence as a simple compromise. You can go higher and mitigate the risk of false negatives by using additional trials, but this requires multiple separate tests to avoid listener fatigue impairing the results.
I think we should require 3 sigma results myself. 5% actually isn't good enough. Manufacturing found when it used 5% results versus nothing the overall quality actually went down. By going with 3 sigma as the cutoff it allowed better control of manufacturing products and quality improved so much it wasn't too hard to get to 5 sigma if you wanted to.
 

JRS

Major Contributor
Joined
Sep 22, 2021
Messages
1,158
Likes
1,007
Location
Albuquerque, NM USA
See I have enough understanding of it that I get why it works in the example you explained. Now is there a reasonably good use of it for the situation where we try and answer the question, "Do two DACs with measurements we expect are audibly transparent nonetheless sound different?"

Many of us here would have a prior in mind saying they do not sound different, while Mani will say, his prior, from experience is they do sound different. Is bayesian reasoning of the problem any help and what is the proper way to approach this question? Actually I would expect bayesian reasoning properly applied would lead to the correct answer for both groups whatever the real answer is. And the answer would be the same for both groups, but with different priors the steps to get to the answer would be different.
From what I can tell (which aint much) is that it is a philosophical issue as much as a mathematical one. To wit,
There are many reasons for adopting Bayesian methods, and their applications appear in diverse fields. Many people advocate the Bayesian approach because of its philosophical consistency. Various fundamental theorems show that if a person wants to make consistent and sound decisions in the face of uncertainty, then the only way to do so is to use Bayesian methods. Others point to logical problems with frequentist methods that do not arise in the Bayesian framework. On the other hand, prior probabilities are intrinsically subjective – your prior information is different from mine – and many statisticians see this as a fundamental drawback to Bayesian statistics. Advocates of the Bayesian approach argue that this is inescapable, and that frequentist methods also entail subjective choices, but this has been a basic source of contention between the `fundamentalist’ supporters of the two statistical paradigms for at least the last 50 years. In contrast, it is more the pragmatic advantages of the Bayesian approach that have fuelled its strong growth over the last 20 years, and are the reason for its adoption in a rapidly growing variety of fields. Source: https://bayesian.org/what-is-bayesian-analysis
Personally, I prefer my mathematics be constrained to the use of symbols, having never given much thought about the philosophy of math.

Just for fun, I turned the null hypothesis around, and assumed that indeed there was a difference and one that Mani could peg 99% of the time. In a 10 trial run, p of getting 10/10 was 0.9 and 9/10 0.1. Less than nine was in the dust at about 0.004%

I still dont know what to make of the "pre-trials" as in maybe it's akin to comparing two test tones that differ by less than 1% in frequency--having an immediate refresher might be critical to success. But I am still having problems with the randomness of the results when no A and B were given just before X. As mentioned above, it seems that Mani have gotten them all right or all wrong, or mostly right or wrong at least.

I do hope that he tries this again with other equipment. I'd only suggest that the assistant be in the same room, and that Mani could familiarize himself at his own pace if need be before the actual "testing" begins.
 

earlevel

Addicted to Fun and Learning
Joined
Nov 18, 2020
Messages
550
Likes
779
The binomial distribution says you'll get at least 9 of 10 boolean questions correct by random guessing only 1.07% of the time (or flipping coins). Which is 100 - 1.07 = 98.93% confidence that you weren't guessing. 10 of 10 is 99.9%.
Well, a part of my point was that 95% confidence doesn't actually mean confidence that you weren't guessing. I know people say it that way a lot, but that's not what it means. 20 of 30 is ~95 confidence. It doesn't mean it's 95% certain that my friend could hear a difference in the two signals. In fact, he admitted he couldn't hear the naked signals below -90 dBFS at his listening station (he's a mastering engineer). So, common sense casts heavy doubt on his 20 of 30 test, which should be far more difficult (listening to the bottom bits of 24-bit, in normal volume music).

Confidence level is the statistical probability that repeated sampling over a population would fall within the confidence interval. What's the confidence interval here? This is applying a calculation that is irrelevant. (BTW, note that if he'd gotten one more selection wrong, 8 of 10, it would be below 95% and people would doubt the significance of his results because it was below the magic number.)

Odds are a different thing entirely. 9 out of 10 for something random is ~1.1%. It can happen, it just won't happen often. But if you do a single trial of 10 attempts and get 9, it doesn't mean there is a 99% chance it was no fluke. It does mean if he were guessing, he won't be able to guess that well again often. And the thing is about small successful tests is that people who try and failed right away are unlikely to post their results, so we tend to hear the success stories.

Like I said, though, 9 out of 10 is definitely enough to get my attention, and either want to see another round. I'm not rooting for him to fail, I'd like to see that he can repeat the performance. A lot of audio people refuse to test themselves, I think it's great if he can hear it. Being skeptical of everything, I'd also want to think about other factors that might have contributed to the sound difference other than solely the thing being tested, but I wouldn't dismiss success as being impossible, and I don't even know what the variables were. Plus, he says he confident he can measure the difference, and if so that would tell a lot.
 
Last edited:

MRC01

Major Contributor
Joined
Feb 5, 2019
Messages
3,484
Likes
4,110
Location
Pacific Northwest
I think we should require 3 sigma results myself. 5% actually isn't good enough. Manufacturing found when it used 5% results versus nothing the overall quality actually went down. By going with 3 sigma as the cutoff it allowed better control of manufacturing products and quality improved so much it wasn't too hard to get to 5 sigma if you wanted to.
In my view, this depends on the number of trials. Approximating 3 sigma as 99.7%, it requires a minimum of 9 straight, or 12 of 13. In a short test like this, are we willing to say that someone who gets 8 of 9, or 11 of 13, can't hear a difference? The odds are with him. You're going to get false negatives. However, with more trials, you can achieve 99.7 with fewer false negatives. Like 23 of 30. If you do enough trials, even getting only 51% of them right will hit 99.7%. This virtually eliminates the chance of false negatives. But then you need to do multiple test sessions to prevent listener fatigue from tainting the results. Which introduces the challenge of consistency: keeping multiple tests done in different sessions consistent enough to aggregate the results.

In short, really high goals like 3 sigma are great, if you can afford them, and devise a way to make them practical. For those of us testing ourselves as a hobby for fun and education, I think 95% is enough to get an objective sense for whether what we hear is real. It's not like lives or nations depend on the results.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,758
Likes
37,598
In my view, this depends on the number of trials. Approximating 3 sigma as 99.7%, it requires a minimum of 9 straight, or 12 of 13. In a short test like this, are we willing to say that someone who gets 8 of 9, or 11 of 13, can't hear a difference? The odds are with him. You're going to get false negatives. However, with more trials, you can achieve 99.7 with fewer false negatives. Like 23 of 30. If you do enough trials, even getting only 51% of them right will hit 99.7%. This virtually eliminates the chance of false negatives. But then you need to do multiple test sessions to prevent listener fatigue from tainting the results. Which introduces the challenge of consistency: keeping multiple tests done in different sessions consistent enough to aggregate the results.

In short, really high goals like 3 sigma are great, if you can afford them, and devise a way to make them practical. For those of us testing ourselves as a hobby for fun and education, I think 95% is enough to get an objective sense for whether what we hear is real. It's not like lives or nations depend on the results.
That brings up another issue. If you have to do so many trials that 51% is a p value of 5% the difference while perhaps real and discernible in some sense is also extremely small. If you get 51% instead of 50% then most of the time, you failed to hear a difference so just how important is that difference once you have reached that level.

One example was the testing with MQA. They had lots of aggregated trials with teams of trained listeners which had additional training in hearing just the difference MQA made. There were additional nits to pick with their procedure, but I think scores were around 56 to 59% when anything less than 56% gave a p value above 5%. Oh and this was in unusually quiet listening rooms with very high quality SOTA gear thru out the chain. To use that as proof MQA is different, and then proclaiming for some people it was a large and important difference once released to the public listening to much less good gear, not in a controlled environment, and not being a trained listener was farcical. Anything so difficult to tease out by definition would be an extremely minor difference even once heard (if heard).
 
Top Bottom