# Could someone help me to think through my ABX result using Bayesian reasoning?

#### MarkS

##### Senior Member
There was more data.
There were 30 attempts but the first 20 were ignored by Mani.

The funny thing about this story is that the first 20 attempts were valid but that data is not properly analyzed it seems.
Mani stated that he could hear what he thought/claims to hear for a long time and reliable.
Remember, this is at his home using familiar equipment and with music chosen by him.

The reasoning used by Mani that only the last 10 attempts are valid (and he would like to ignore the one he got wrong) baffles me but understand his reasoning.
This is not valid reasoning. You must state the rules of the test beforehand, and then follow them to the letter. The rules must state the conditions under which a trial will be discarded, and the "keep or discard" decision cannot be made AFTER the listener has made the AB choice in a particular trial. Eg, if the doorbell rings during a trial, it is OK to discard it BEFORE the listener has chosen A or B. But if after making the choice, the listerer says, "you know what, I was kinda fatigued, let's drop that one": no. Not allowed.

OP

#### manisandher

##### Senior Member
Let’s not turn this on it’s head. Let’s first without a shadow of doubt agree that you actually can hear a difference. If so, we can figure out why.

Well, what does without a shadow of a doubt mean exactly?

It's usually taken as p=0.05 (i.e., Type I errors below 5%.)

I got a result of p=0.01 in the ABX. A result of p=0.0133 if we take all 3 tests (but count the correct flips from A to B in tests 1 and 2 as a positive response - see analysis above). p=0.05 if we just sum the correct responses in all 3 tests.

What would p need to be for people to agree without a shadow of doubt? If I hadn't made a mistake on #9, it would have been p=0.001. Would that have convinced anyone here? No, of course not. Because apparently, it's not only the p number that matters, but the number of samples too.

So, how many samples would be OK? 100? Well, if I got 68/100, that'd be the same probability as 9/10. Would that convince anyone? If so, why would that be more convincing than 9/10? (In any event, I wouldn't do 100. That'd be listening to 300 samples. No way.)

30x ABX might be doable, over 3 sittings. If I were to get 25/30, p=0.0001, would that be good enough for people to agree without a shadow of doubt that I really am hearing a difference?

#### Raindog123

##### Major Contributor
Forum Donor
Look, we could have chosen another variable, e.g., USB cables. But I went for a variable the effect of which I was very familiar. Any of the 100s of regular XXHighEnd users would instantly recognise my description of the piano attack sounding 'soft' with a larger SFS, and 'sharp' with a low SFS.

And yet the SFS setting does not change the bits in any way, shape or form.

It’s been mentioned here multiple times - being “bit-identical” is not sufficient for a line that carries _both_ data and clock. Unless you two verified that the clock spectra were identical between your A and B runs, this clock variation is a possible hypothesis of the difference you potentially heard (ie an assumption requiring further investigation).

However, such hypothesis only applies to your described configuration — where the spdif clock stability of the sourcing PC (an active device!) can potentially be affected by this “SFS” setting… To expect the same [clock jitter] effect from, eg, replacing a passive USB cable is entirely different story. So, let’s be careful piling everything into one “I can hear difference between anything and everything“.

Last edited:

#### mansr

##### Major Contributor
Well, what does without a shadow of a doubt mean exactly?

It's usually taken as p=0.05
Are you really that naive?

OP

#### manisandher

##### Senior Member
The reasoning used by Mani that only the last 10 attempts are valid (and he would like to ignore the one he got wrong) baffles me but understand his reasoning.

Oh, I missed this. Not once have I said that. Not once.

I said that I was surprised that I got #9 wrong, but that I was tired by that point. That's all. I've always included that miss.

OP

#### manisandher

##### Senior Member
Are you really that naive?

So tell me, what does without a shadow of doubt mean to you. Come on.

#### JRS

##### Addicted to Fun and Learning
I routinely hear a difference in my system all the time. Some days it sounds really great and I congratulate myself on having reached end-game excellence. Some days it sounds OK. Some days I find myself looking through reviews and wondering about an upgrade.

This all happens without making any change to the system at all. I’ve seen an avowed subjectivist write about how he experiences the same phenomenon.

“Hearing differences” happens. Personally I wonder whether medium-term changes in the tone of the tensor tympani muscle are altering the response profile of the middle ear, but research on this is very sparse.
Interesting hypothesis as we have all had experiences as described. While it might be tempting to ascribe these perceptual changes to mood, I don't believe that accounts for it entirely. There are of course two muscles that influence the transfer function of the ossicular chain, the tensor tympani and the stapedius, both of which serve to decouple the tympanic membrane from the inner ear (according to some sources, only the stapedius functions as such in humans) and prevent damage to the hair cells (which of course is why alcohol inebriation (relaxes the stapedius) and high volumes are a bad mix. I've seen papers that measure the transfer function in freshly dead humans and living cats, but unaware of any that look at that as a function of muscular tone. The other muscle that may impact transmission is the tensor veli palatini muscle that helps to open the eustachian tube ("popping the eardrums" to equalize pressure between middle ear and ambient pressure). In any event there is a lot going on well before we get to "mood," much of which we are unaware of.

If the muscles are still in good shape, it would seem that electrical stimulation and contraction might afford a good look at the phenomenon. There may be such studies but available only in abstract. And that is a subject I could rant about for hours.

#### MarkS

##### Senior Member
@MarkS , thanks so much for sharing.

So am I correct in thinking that d is pretty much just the opposite of h? I.e.,

h = fraction where listener hears a difference and responds correctly
d = fraction where listener hears a difference and responds incorrectly

So, if we assume h=0, then:
1-h = all correct responses are guesses
1-d = all incorrect responses are guesses

So basically, 50/50.

(Apologies if I'm being a bit thick.)

Mani.
No, this is not correct.

d and h are categorically different numbers.

h is the fraction of trials in which you can actually hear a difference (over a long run of very many trials under the same conditions), and it is has an actual vaule in the real world, we just don't know what that value is; h is an "objective" number. With enough trials (thousands), we could estimate the true value of h with high confidence.

d is how likely someone thinks it is that h is not zero before doing any trials. d is a "subjective" number. Many here (me included) would assign a very small value to d. Each person gets to choose their own value of d.

The utility of Bayesian analysis is that it quantifies by how much your beliefs should change in the light of new data. If d starts out very small, then after 9 correct answers in 10 trials, d would increase by a factor of 18.5.

More trials is much better. After 27 correct answers out of 30, d would increase by a factor of 17,000.

My input value of d would be 0.001 (1 in a thousand), allowing for some glitch somewhere in your system that really does produce an audible difference. 27/30 would raise d above 1 (not actually possible, I'm just using an approximate formula here) and would convince me that you are very likely hearing something real, given my personal prior of d=0.001 going in.

But 9/10 raises my d to just 0.0185, so about a 2% chance that you are hearing something real.

And, as many have said, if there is something that can be heard, then there is absolutely something that can be measured. Measurements are FAR more sensitive than hearing, by many orders of magnitude.

Last edited:
OP

#### manisandher

##### Senior Member
No, this is not correct.

d and h are categorically different numbers.

h is the fraction of trials in which you can actually hear a difference (over a long run of very many trials under the same conditions), and it is has an actual vaule in the real world, we just don't know what that value is; h is an "objective" number. With enough trials (thousands), we could estimate the true value of h with high confidence.

d is how likely someone thinks it is that h is not zero before doing any trials. d is a "subjective" number. Many here (me included) would assign a very small value to d. Each person gets to choose their own value of d.

The utility of Bayesian analysis is that it quantifies by how much your beliefs should change in the light of new data. If d starts out very small, then after 9 correct answers in 10 trials, d would increase by a factor of 18.5.

More trials is much better. After 27 correct answers out of 30, d would increase by a factor of 17,000.

My input value of d would be 0.001 (1 in a thousand), allowed for some glitch somewhere in your system that really does produce an audible difference. 27/30 would raise d above 1 (not actually possible, I'm just using an approximate formula here) and would convince me that you are very likely hearing something real, given my personal prior of d=0.001 going in.

But 9/10 raises my d to just 0.0185, so about a 2% chance that you are hearing something real.

And, as many have said, if there is something that can be heard, then there is absolutely something that can be measured. Measurements are FAR more sensitive than hearing, by many orders of magnitude.

Thanks.

So, 25/30 would be good to aim for were I to repeat the test (without knowing the exact factor by which this would increase your d=0.001 input value).

I'll take some measurements and do some listening over the Christmas break. Depending on how things go, I might set up another test. And if I do, I'll lay out the exact protocol and method here for the experts to comment on beforehand.

#### solderdude

##### Grand Contributor
Oh, I missed this. Not once have I said that. Not once.

I said that I was surprised that I got #9 wrong, but that I was tired by that point. That's all. I've always included that miss.

I know but the misses before were kind of dismissed. The 9th in the second you had an excuse for so basically for you only the first 8 attempts in the 3rd run you like to see as valid and evidence was my point. The 10th being correct (while tired and missing the 9th because of it) thus should also be dismissed as you were more tired than at the 9th attempt (arguably) and thus should not be in the 'I clearly heard it' range.
So not 9 out of 10 but 8 out of 8 and dismissing other valid attempts is what seems to be a more logical approach if you want to validate your hearing abilities.

So when adding the first 2 runs the first ABX would have been valid to you (they weren't practice runs) as real ABX + 10 attempts using ABX you could also say you got 10 out of 12.
I know nothing about math but is is probably worse than 9 out of 10 and more honest to yourself when you claim lack of reference being the culprit.
Still not bad but, again, those into the math thing can probably tell you how big the chance is you would get the same results throwing a coin 12 times in a row and hitting this score.

Good to read you are contemplating another test with more attempts. Kudos for at least considering it.

#### MarkS

##### Senior Member
Thanks.

So, 25/30 would be good to aim for were I to repeat the test (without knowing the exact factor by which this would increase your d=0.001 input value).

I'll take some measurements and do some listening over the Christmas break. Depending on how things go, I might set up another test. And if I do, I'll lay out the exact protocol and method here for the experts to comment on beforehand.
You shouldn't be "aiming for" anything. You should be listening carefully, without knowing the source, and writing down your answer as to whether it's A or B.

Ideally, this would continue for the total number of trials without you ever knowing the result of any trial before they are all completed. Then and only then are the results "unblinded", and you see how many you got right.

Also, the total number of trials should be decided upon in advance, and not changed afterward.

If you don't follow these "best practices", allowance must be made for that in the analysis, and it considerably weakens the results. For example, if you decide to do 20 trials, find you got 12/20, and then decide to do another 10: that's cheating!!! It changes the significance of the result, because if you had got 18/20, you would have stopped. This skews the actual results towards looking better than they would be with the number of trials fixed in advance.

#### MRC01

##### Major Contributor
... And yet the SFS setting does not change the bits in any way, shape or form.
However, if "loading" the PC means CPU usage, it could change the DAC's reference voltages or intersample timing (jitter).

#### mansr

##### Major Contributor
However, if "loading" the PC means CPU usage, it could change the DAC's reference voltages or intersample timing (jitter).
That's extremely unlikely.

#### JRS

##### Addicted to Fun and Learning
However, if "loading" the PC means CPU usage, it could change the DAC's reference voltages or intersample timing (jitter).
I have no wish to go back to the forum page, but that I believe is exactly what is meant. On that page and others there are discussions of file size (and the now famous Split File Size), along with overclocking, core usage and more well beyond my knowledge of CPU's, and also well beyond what I understand the resource requirements of simple PCM playback to be. Like what the hell does odd numbers under 10 when split file size is being used, which apparently have audible SQ effects within the SFS on mode. Clearly, there is a lot going on under the hood of this player, and cause me to doubt the identical bit playback claimed. But I am assuming that the claim has been verified, so cognitively stuck. And why I look forward to the USB cable experiment.

I'll suggest it again that when published claims such as these should be accompanied by a detailed "material and methods" section so it is crystal clear at the outset. I recall a similar thread where it took a while to discover that the playback level matching between three amps on two sets of speakers was done by matching SPL's at 105dB. I believe all were in agreement that the matching should have been done by voltage.

Last edited:

#### mansr

##### Major Contributor
I have no wish to go back to the forum page, but that I believe is exactly what is meant. On that page and others there are discussions of file size (and the now famous Split File Size), along with overclocking, core usage and more well beyond my knowledge of CPU's, and also well beyond what I understand the resource requirements of simple PCM playback to be. Like what the hell does odd numbers under 10 when split file size is being used, which apparently have audible SQ effects within the SFS on mode. Clearly, there is a lot going on under the hood of this player, and cause me to doubt the identical bit playback claimed.
The digital input to the DAC was captured, and aside from the first few samples that the recording device failed to capture, everything was identical. I checked it myself. Any disruption in the output would have resulted in easily detected glitches in the captures.

Sound cards are unaffected by CPU load as long as samples are available when needed. If the CPU can't keep up, very obvious glitches result. There are no in betweens.

#### Raindog123

##### Major Contributor
Forum Donor
Sound cards are unaffected by CPU load as long as samples are available when needed. If the CPU can't keep up, very obvious glitches result. There are no in betweens

You’ll be surprised…

And what’s the alternative - Mani can read your mind? You tell us, you were there.

Last edited:

#### Blumlein 88

##### Grand Contributor
Forum Donor
The digital input to the DAC was captured, and aside from the first few samples that the recording device failed to capture, everything was identical. I checked it myself. Any disruption in the output would have resulted in easily detected glitches in the captures.

Sound cards are unaffected by CPU load as long as samples are available when needed. If the CPU can't keep up, very obvious glitches result. There are no in betweens.
Not sure about that. I had an old M Audio card. If it were playing music jitter was a bit high. If it were playing music while doing video in the background or other intense tasks somehow the jitter went higher and higher the more the CPU did. The computer had a gamer high wattage power supply.

I also had another machine with a built in sound card where one channel was jitter-sensitive to the load on the rest of the machine, and one channel was not effected. So some such oddities aren't impossible, but more or less should be.

#### Raindog123

##### Major Contributor
Forum Donor

I am sure you‘ve heard of covert information channels. And the Tempest spec…

But you’re avoiding my question - 9/10 can’t simply be ignored, and if confirmed by the test rerun what else explains it?

Last edited:

#### voodooless

##### Major Contributor
Forum Donor
I also had another machine with a built in sound card where one channel was jitter-sensitive to the load on the rest of the machine, and one channel was not effected. So some such oddities aren't impossible, but more or less should be.
That is not how jitter works. You cannot have this in one channel only. How do you you even know it was jitter?

Replies
21
Views
1K
Replies
71
Views
3K
Replies
4
Views
515
Replies
33
Views
3K
Replies
10
Views
759