Could someone help me to think through my ABX result using Bayesian reasoning?

MarkS · Dec 11, 2021

manisandher said:
Thanks Mark.

Before doing the test, I was confident that I would be able to demonstrate the differences that I was certain I was hearing. My confidence level was high. But, I'd never been involved in a listening test before. So lets say my prior was 0.9.

I achieved 0.99 in the ABX.

So what's my posterior, in light of the new data?

Mani.

This is too vague to be a prior.

In my old AES paper, I assumed that, for each guess, the listener hears a difference in a fraction h of the tests (0<=h<=1) and responds correctly. In a fraction 1-h of the tests, the listener does not hear a difference (even if thinks he does), and guesses. The probability of a right answer (for a single trial) is then h + (1/2)(1-h) = (1/2)(1+h), and the probability of a wrong answer is (1/2)(1-h). In a sequence of trials, the probability of getting R right answers and W wrong answers is N (1+h)^R (1-h)^W, where N is a constant that depends on R and W but not h (and whose value will not matter). This is called the likelihood function.

Now we need a prior. Before doing the experiment, what do we think about h? Skeptics of the audibility of differences would want to say that it is very likely that h is at or very near zero. Someone with no skepticism (or knowledge of prior audio test results) could apply Laplace's Principle of Indifference, and say that any value of h between zero and one is equally likely.
So if we call the prior Prior(h), the "indifference" prior is Prior(h)=1.

To get the posterior we mulitply the likelihood by the prior,

Post(h) = N' (1+h)^R (1-h)^W Prior(h)

where N' is a different constant, and is fixed by the requirement the integral over h from 0 to 1 of Post(h) is 1. (In words, the posterior probabilites sum to 1.)

For R=9 and W=1, and the "indifference" prior Prior(h)=1, here is a graph of Post(h):

So if we had no idea of what h should be going in, we now think it's most likely to be near 0.8.

Note that, because there was one wrong answer, we are now certain that h=1 is not possible, and so Post(h) drops to zero as h approaches 1.

What happens if we think, going in, hearing a difference is very unlikely?

I won't go through the math (and this case was not treated in the paper), but suppose you assign a prior probability of 1-d to h=0 exactly, and probability d to h being equally likely to be anywhere between 0 and 1. If d is very small, it means that you're prelty darn sure that no difference can be heard, but you'll allow a little wiggle room. Then, after the listener gets 9 right and 1 wrong, the posterior vaule of d goes up to 18.5 times the original value. So if d=0.01 going in (that is, you allow a 1% probability that a difference can be heard before the test), then afterward, that probability rises to 0.185. That's a noticeable effect. But if your starting value of d is much smaller, say 0.0001, then after it will be 0.00185: still really small. Then you'd need more convincing.

Anecdote: I wrote that AES paper shortly after discovering Bayesian analysis and being thoroughly besotted with it as just the bees knees. I was explaining it, rather excitedly, to an experimental physicist. He yawned and said, "If how you do the statistical analysis matters, then you need to get more data."

And that, ladies and gentlemen, is correct! 10 tests just isn't very many. But get 90 right out of 100, and the factor increase in d goes from 18.5 to 10^15. Now THAT would be data that could only be ignored by the most confirmed skeptic going in.

solderdude · Dec 11, 2021

There was more data.
There were 30 attempts but the first 20 were ignored by Mani.

The funny thing about this story is that the first 20 attempts were valid but that data is not properly analyzed it seems.
Mani stated that he could hear what he thought/claims to hear for a long time and reliable.
Remember, this is at his home using familiar equipment and with music chosen by him.

The reasoning used by Mani that only the last 10 attempts are valid (and he would like to ignore the one he got wrong) baffles me but understand his reasoning.

The reason Mani claims the first 20 attempts should be dismissed is that the method is wrong. That it was just warming up. That there was stress in the first 20 attempts but the stress went away after he found out he could not ace the test yet claimed he heard differences.
This is nonsensical. I believe he heard differences. Of course he did, that seems to have been the case for years. It's just that the differences he heard had no relation to reality. This too happens to everyone so it proves Mani is no super human but just human.

He heard this for many years, knows what he was listening to on his own gear so if Mani could tell really tell differences and they were really there ,inexplicably for some technical reason(s), then in the first 20 attempts there should at least be a pattern that showed that when X went from A to B that should be heard every time.
Mani claims there was no reference (like in the 3rd attempt) but there was. Each X before the next X was a reference when listening for a difference.

I am baffled that the actual input (A or B) and the result (Mani's answer A or B) weren't logged or made public. That could have brought so much more insight to the table.

I am not a statistics guy but 30 attempts seems more valid than 10. Both are blind tests, both are valid and have a reference, the only claimed difference was Mani's mindset.

That would be weird as
A: Mani claims fatigue set in (the one he missed) but that excuse wasn't used in the first 20 attempts.
B: Mani claims the first 20 attempts had no reference yet they did but did not disclose the data, only that he was guessing (yet claims he heard differences yet guessed.
C: Mani claims he wants to know but didn't opt for having 'the expert' over one more time and make 20 more ABX attempts at a later date.

Does Mani want to prove something to himself or to the world ?
In case of the first he should believe the 3rd run and only the first 8 attempts were good (because of fatigue setting in the test stopped being valid)
In case of the 2nd option (proving the world) he should do another blind, witnessed attempt with at least 20 attempts and proper reporting.
He should be allowed to test run a few times till he is confident the test should be started.

So Mani... if you want to be taken seriously by others.... do it again.
If you want to believe... claim you did a real blind test. Made 8 attempts and totally killed it.

charleski · Dec 11, 2021

krabapple said:
But here we have a different situation --- a very, very, very, common situation in audio hobby land -- where someone claims they already hear a difference ROUTINELY.

I routinely hear a difference in my system all the time. Some days it sounds really great and I congratulate myself on having reached end-game excellence. Some days it sounds OK. Some days I find myself looking through reviews and wondering about an upgrade.

This all happens without making any change to the system at all. I’ve seen an avowed subjectivist write about how he experiences the same phenomenon.

“Hearing differences” happens. Personally I wonder whether medium-term changes in the tone of the tensor tympani muscle are altering the response profile of the middle ear, but research on this is very sparse.

AudioStudies · Dec 11, 2021

MarkS said:
Anecdote: I wrote that AES paper shortly after discovering Bayesian analysis and being thoroughly besotted with it as just the bees knees. I was explaining it, rather excitedly, to an experimental physicist. He yawned and said, "If how you do the statistical analysis matters, then you need to get more data."

And that, ladies and gentlemen, is correct! 10 tests just isn't very many. But get 90 right out of 100, and the factor increase in d goes from 18.5 to 10^15. Now THAT would be data that could only be ignored by the most confirmed skeptic going in.

Thank you for hitting the nail right on the head.

manisandher · Dec 11, 2021

@solderdude , I appreciate you sharing your thoughts, but I take issue with the majority of your post.

solderdude said:
Remember, this is at his home using familiar equipment and with music chosen by him.

Yes. I would never have agreed to the test otherwise.

solderdude said:
The reason Mani claims the first 20 attempts should be dismissed is that the method is wrong. That it was just warming up.

No, the 2 non-ABX tests were not planned as warm-ups at all. They were planned as real listening tests. They just happened to be non-ABX, with all the difficulties that this introduced.

What I meant was that in retrospect, they acted as good warm-ups for the ABX, and that they strongly suggested that there were no 'tells'.

solderdude said:
He heard this for many years, knows what he was listening to on his own gear so if Mani could tell really tell differences and they were really there ,inexplicably for some technical reason(s), then in the first 20 attempts there should at least be a pattern that showed that when X went from A to B that should be heard every time.

I wasn't listening for a switch from A to B (or vice versa) in the first 2 non-ABX tests. I was listening to an X in isolation and asking myself, does that sound like an A or a B? This meant I had to rely on my memory of A and B played at the beginning of each 10-sample non-ABX test. I found this impossible to do, and so guessed.

solderdude said:
I am baffled that the actual input (A or B) and the result (Mani's answer A or B) weren't logged or made public. That could have brought so much more insight to the table.

Well, they're logged now, and I can't see any insights:

Can you?

solderdude said:
I am not a statistics guy but 30 attempts seems more valid than 10. Both are blind tests, both are valid and have a reference, the only claimed difference was Mani's mindset.

20/30, p=0.05. If you feel that this is the best that we can take out of this whole endeavour, that's fine.

solderdude said:
Does Mani want to prove something to himself or to the world ?

Well... someone's already mentioned "you can take a horse to water...". This works both ways. It's pretty clear from this thread that most people here aren't very curious about, firstly, the possibility of bit-identical playback being audibly different, and secondly, the results of the listening test, especially the ABX. That's a real shame I think , because anyone interested in science should have a curious mind. IMHO.

solderdude said:
So Mani... if you want to be taken seriously by others.... do it again.

The only scenario that would make me do more listening tests would be if I managed to measure the differences. I know my curiosity would force me to want to correlate the measured differences with what I could hear.

Mani.

manisandher · Dec 11, 2021

@MarkS , thanks so much for sharing.

So am I correct in thinking that d is pretty much just the opposite of h? I.e.,

h = fraction where listener hears a difference and responds correctly
d = fraction where listener hears a difference and responds incorrectly

So, if we assume h=0, then:
1-h = all correct responses are guesses
1-d = all incorrect responses are guesses

So basically, 50/50.

(Apologies if I'm being a bit thick.)

Mani.

solderdude · Dec 11, 2021

Thanks for your reply.

Its good to see the actual input and output so we know how often the input changed. It seems to confirm you were guessing even though you claimed to hear differences. It would have been fun (this is a test I have done on a few occasions) to make NO changes at all and see if that was detected or no. Alas... we will never know.
Also the reverse responses also are random in length suggesting what you think you heard wasn't what was present about half of the time.

This leaves but one option IF you are out to prove the 3rd attempt wasn't the fluke and that is to repeat the 3rd test but at least 20 times.

Only then you can really prove you have some special abilities or the effect really exists and you beat the odds and have to make engineers reconsider knowing what is relevant.

You can forget about capturing these differences because the ADC you will be using also has similar timing errors to begin with.

This leaves but 1 way to prove things to engineers.
Redo the test with more samples and perhaps even some longer AAAAAA or BBBBB in the test as your brain expects changes to occur.
You can use the ABXABX test.

All the rest of the talk will amount to the same hill of beans with no changes in any of the peoples minds.

As MarkS says.. you need more data. This means more ABX testing, properly conducted. You could do this over several days and quite when you think you had enough. It needs to be verified independently though. So a lot of train tickets will have to be paid...

You seem to want a trick or method that proves only the third try of 10 samples is enough to convince engineer type people. They won't. When you want to prove it you need to repeat with enough statistical evidence. measuring capturing a significantly different waveform is not going to happen. If that were possible it would have been done already.

BDWoody · Dec 11, 2021

manisandher said:
It's pretty clear from this thread that most people here aren't very curious about, firstly, the possibility of bit-identical playback being audibly different, and secondly, the results of the listening test, especially the ABX. That's a real shame I think , because anyone interested in science should have a curious mind. IMHO.

I think that's a bit disingenuous...

You've had a pile of real experts here trying to help you make sense of this. You have a lot of us following along with interest.

What you also have is healthy, warranted skepticism. I, for one, would LOVE to see something like this actually be demonstrated, as it would make things suddenly more interesting. Given how unlikely your claim seems, you should expect a faulty experiment or methodology is going to be a deal killer when it comes to convincing anyone of anything.

You seem to be determined to cram your results into a 'good enough' box, but that's not going to work either.

For me, if there was a money bet, I'd bet a lot that if this test is properly done, you'd trend towards random, not the opposite.

Happy to be shown through proper testing I am wrong, but I'd bet I'm not.

manisandher · Dec 11, 2021

BDWoody said:
You've had a pile of real experts here trying to help you make sense of this.

Yes, I very much appreciate that.

BDWoody said:
What you also have is healthy, warranted skepticism.

Apart from your spelling (

), perfectly valid comment.

BDWoody said:
Given how unlikely your claim seems, you should expect a faulty experiment or methodology is going to be a deal killer when it comes to convincing anyone of anything.

What was I really hoping to achieve in this thread? Convincing people? I don't think so.

The test was conducted well over 3 years ago now. Dismissed as a fluke immediately afterwards. I've been a member here for much longer than that, and yet I've never mentioned the test on this forum before, unitl this thread.

There's been something bugging me about the dismissal, and not taking into account the context in which the test took place - the priors.

But according to all the experts here, my priors weren't strong enough. OK, I accept that.

Going forward though... Were I to repeat such a test, I'm assuming no-one here could criticise me for assuming a high prior (0.9 or so), if you accept the 20/30 result. I'll take that.

Mani.

abdo123 · Dec 11, 2021

solderdude said:
There was more data.
There were 30 attempts but the first 20 were ignored by Mani.

The funny thing about this story is that the first 20 attempts were valid but that data is not properly analyzed it seems.
Mani stated that he could hear what he thought/claims to hear for a long time and reliable.
Remember, this is at his home using familiar equipment and with music chosen by him.

The reasoning used by Mani that only the last 10 attempts are valid (and he would like to ignore the one he got wrong) baffles me but understand his reasoning.

The reason Mani claims the first 20 attempts should be dismissed is that the method is wrong. That it was just warming up. That there was stress in the first 20 attempts but the stress went away after he found out he could not ace the test yet claimed he heard differences.
This is nonsensical. I believe he heard differences. Of course he did, that seems to have been the case for years. It's just that the differences he heard had no relation to reality. This too happens to everyone so it proves Mani is no super human but just human.

He heard this for many years, knows what he was listening to on his own gear so if Mani could tell really tell differences and they were really there ,inexplicably for some technical reason(s), then in the first 20 attempts there should at least be a pattern that showed that when X went from A to B that should be heard every time.
Mani claims there was no reference (like in the 3rd attempt) but there was. Each X before the next X was a reference when listening for a difference.

I am baffled that the actual input (A or B) and the result (Mani's answer A or B) weren't logged or made public. That could have brought so much more insight to the table.

I am not a statistics guy but 30 attempts seems more valid than 10. Both are blind tests, both are valid and have a reference, the only claimed difference was Mani's mindset.

That would be weird as
A: Mani claims fatigue set in (the one he missed) but that excuse wasn't used in the first 20 attempts.
B: Mani claims the first 20 attempts had no reference yet they did but did not disclose the data, only that he was guessing (yet claims he heard differences yet guessed.
C: Mani claims he wants to know but didn't opt for having 'the expert' over one more time and make 20 more ABX attempts at a later date.

Does Mani want to prove something to himself or to the world ?
In case of the first he should believe the 3rd run and only the first 8 attempts were good (because of fatigue setting in the test stopped being valid)
In case of the 2nd option (proving the world) he should do another blind, witnessed attempt with at least 20 attempts and proper reporting.
He should be allowed to test run a few times till he is confident the test should be started.

So Mani... if you want to be taken seriously by others.... do it again.
If you want to believe... claim you did a real blind test. Made 8 attempts and totally killed it.

You can't in good conscious say that the first two tests are valid, there is no way in hell anyone can 'pass' that format unless there is a mind blowing tonal change going on that you only need to hear the difference once and it's stuck in your head.

If that's gonna be our definition of 'makes a difference' then the validity of the entire frame of thought this forum is based on is incredibly flawed.

BDWoody · Dec 11, 2021

manisandher said:
Apart from your spelling (), perfectly valid comment.

Awesome...grammar police hat too.

I'm still too dumb too see what I misspelled. I guess that makes all the difference.

manisandher · Dec 11, 2021

BDWoody said:
Awesome...grammar police hat too.

Come on. Just a UK/US jibe. I thought our 'special relationship' allowed this.

BDWoody · Dec 11, 2021

manisandher said:
Come on. Just a UK/US jibe. I thought our 'special relationship' allowed this.

Didn't catch it... Fair play. Carry on.

PierreV · Dec 11, 2021

manisandher said:
Going forward though... Were I to repeat such a test, I'm assuming no-one here could criticise me for assuming a high prior (0.9 or so), if you accept the 20/30 result. I'll take that.

The prior either has to be extremely low (as in Mark's example) because you are trying to prove something extraordinary given the hard data collected in a myriad of other previous blind tests/ ABX tests or could be 0.5 (with a slightly different but fundamentally identical treatment, yielding the same result) and your goal, in that case, would be to prove you doing better than guessing.

I suggest you watch the MIT open courseware video linked by xavier in this thread, particularly around the time stamp I mentioned to understand why everyone would be correct to criticize your prior. Your prior selection is just making sure the experiment maximizes your own subjective bias from the start.

In any case, numerically speaking, the result would be weaker than the pure frequentist approach. In reality, since the observed data did not change, its true meaning would be identical.

It should be noted that, in both cases, the data is not sufficient to draw _any_ conclusion. In fact (and that may depend a bit on the field in which you learned statistics), if I had had your data as an exam question, my teacher would have shot me down for applying either of those methods to the data presented here without lots of caveats or special treatment. Yes, you can apply and will see Bayesian approaches applied to small data sets but that is only acceptable when either the physical world gives you no choice (random example, repeating GRBs until very recently) or your experimental budget is limited (in which case you qualify your result with a near standard "may suggest that blah blah blah" that smoothly segues into a "could deserve further investigations" which is equivalent to "more money please")

Finally, as far as the protocol is concerned, train as much as you want, but be clear about when the training stops. When the test begins, specify in advance when it ends in terms of the number of trials. If the test can't be completed (say "listener fatigue"), report the test as "dropped out" or treat the non completed results as failure.

solderdude · Dec 11, 2021

abdo123 said:
You can't in good conscious say that the first two tests are valid, there is no way in hell anyone can 'pass' that format unless there is a mind blowing tonal change going on that you only need to hear the difference once and it's stuck in your head.

If that's gonna be our definition of 'makes a difference' then the validity of the entire frame of thought this forum is based on is incredibly flawed.

You either hear a difference in 'attack' or you don't.
When you just get X and X and hear a difference then there must be a transition from A to B or B to A and is why I asked for the input.

With ABX you either hear a difference (when X = A) or you don't (X=B)

2 different methods to detect A and B differences (not preferences)

abdo123 · Dec 11, 2021

solderdude said:
You either hear a difference in 'attack' or you don't.
When you just get X and X and hear a difference then there must be a transition from A to B or B to A and is why I asked for the input.

With ABX you either hear a difference (when X = A) or you don't (X=B)

2 different methods to detect A and B differences (not preferences)

After 15 seconds you will forget what A is and what B is, and the whole test becomes useless.

OP had only access to X, never to A or B once the test began, which in my opinion, is pointless.

manisandher · Dec 11, 2021

solderdude said:
With ABX you either hear a difference (when X = A) or you don't (X=B)

I certainly didn't consciously approach any of the tests like that. As I said, I'd never done any listening tests beforehand, so had no preconceived ideas of how to approach them.

For me, it was simply does X sound more like the A I just heard or the B I just heard. That's all.

manisandher · Dec 11, 2021

abdo123 said:
OP had only access to X, never to A or B once the test began, which in my opinion, is pointless.

In the first 2 tests, yes.

abdo123 said:
After 15 seconds you will forget what A is and what B is, and the whole test becomes useless.

But this is what I was doing in the ABX... which wasn't pointless.

solderdude · Dec 11, 2021

manisandher said:
I certainly didn't consciously approach any of the tests like that. As I said, I'd never done any listening tests beforehand, so had no preconceived ideas of how to approach them.

For me, it was simply does X sound more like the A I just heard or the B I just heard. That's all.

Well... I'd have to agree that AB, ABX tests and fully blind tests are hard to do when the differences are small, let alone not existent.
About the latter... I have been fooled a few time hearing differences where, after the test, it appeared there were no actual differences.
Also the other way around... when there were pretty measurable differences I could not tell them apart.
And I might add have done blind (level matched) tests and audibility tests for over 20 years now. Occasionally that is, I find them hard to do when differences are really small or when I have no reference.

In any case... test 1 and 2 did not seem to justify the thinking that the differences were 'clearly audible'. The latter serie casts doubt about audibility.
But ABX basically is ABA or ABB in which case you hear a difference between A and B and again from B to A.
In ABB case you hear a difference and going to B and then no difference.

This is no different than XXXXX where ABB or ABA are followed. This was not consistent in the first 2 attempts so I see no reason to believe ABXABX is so much better than ABXXXXXXX.

Resetting the brain is important with all listening tests.

So... when something would be bugging me for 3 years I would have tried to find out. Best way would be to repeat the test (by using someone you can trust) and try again spanned over a few days. When getting tired just stop and continue later.

AudioStudies · Dec 11, 2021

Seems like a good time to mention the songs "Take the A train" and "I've Got a Ticket to Ride"

Could someone help me to think through my ABX result using Bayesian reasoning?

Major Contributor

Grand Contributor

Major Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Addicted to Fun and Learning

Grand Contributor

Chief Cat Herder

Addicted to Fun and Learning

Master Contributor

Chief Cat Herder

Addicted to Fun and Learning

Chief Cat Herder

Major Contributor

Grand Contributor

Master Contributor

Addicted to Fun and Learning

Addicted to Fun and Learning

Grand Contributor

Addicted to Fun and Learning

Similar threads