Statistics of ABX Testing

Blumlein 88 · Apr 3, 2016

fas42 said:
I never worry about volume - because it makes zero difference when listening for defects: a mechanic doesn't move closer and further away from a car that's making a funny noise to "adjust the volume" - it's a yes/no decision as to whether there's a problem.

My gripe about so much of this testing is the use of the word "preference". To me, the sound is either 'right' or 'wrong', it's very clear cut - dud system A, dud system B, two unpleasantnesses - I would "prefer" to switch both off, and go outside ...

So you're a digital listener then. Everything is GOOD or BAD no in between. I guess you bypass all volume controls on all gear. The item is either on or off. Volume makes zero difference, there is either volume or no volume. I guess Fletcher_Munson doesn't apply to you. You can evaluate the relative frequency balance of a recording whether it is played back at 30 db or 130 db.

Your mechanic story is pretty funny too. I once worked on 24 cylinder methane powered engines. There were exhaust pipes, but in the building next to the engines it was around 115-120 db. If you simply listened to that even briefly you could hear nothing useful. Put on some ear muffs, knock that noise down 25 db, and you could hear details. You could hear a wrist pin making noise, or valves that needed adjusting or even that the timing was a bit off. So even your goofy mechanic example holds no water.

March Audio · Apr 3, 2016

fas42 said:
I never worry about volume - because it makes zero difference when listening for defects: .

Sorry fas but this couldn't be more wrong, unless of course you are referring to the most gross and obvious of defects.

Your hearing sensitivity changes with volume.

https://en.m.wikipedia.org/wiki/Equal-loudness_contour

AJ Soundfield · Apr 3, 2016

BE718 said:
Sorry fas but this couldn't be more wrong

Would you expect anything less from Frank? It's part of the gag

John Kenny · Apr 3, 2016

amirm said:
We don't. So we rely on tests that qualify categories of products. Our job is to indicate how unlikely it is for some product to make a difference/improvement. How far we go and how much one relies on that is up to the person. Let's get the data on the table as there are a lot more blind tests that we even know about.

So can you give (put on the table) examples of "tests that qualify categories of products" & how we might use it to make a judgement about a specific device we might be considering?

fas42 · Apr 3, 2016

Blumlein 88 said:
So you're a digital listener then. Everything is GOOD or BAD no in between. I guess you bypass all volume controls on all gear. The item is either on or off.

Strangely enough, I have no volume control at the moment. More correctly, I do, but I run it at the end of the pot track, because I've changed the gain setting circuitry so that this means the pot is effectively out of the picture. Why? Because it's a crap, cheap Alps pot - at some stage I shall experiment with decent parts, to see if I can get "transparency" there.

Your mechanic story is pretty funny too. I once worked on 24 cylinder methane powered engines. There were exhaust pipes, but in the building next to the engines it was around 115-120 db. If you simply listened to that even briefly you could hear nothing useful. Put on some ear muffs, knock that noise down 25 db, and you could hear details. You could hear a wrist pin making noise, or valves that needed adjusting or even that the timing was a bit off. So even your goofy mechanic example holds no water.

Strangely enough, conventional cars don't have noise abatement issues - so we're not talking silly examples here. The point being, is that you listen for something being wrong, that's the idea - if I hear nothing wrong, or I'm not in a fussy, investigative mood then it's 'right'; but if I I hear problems in just normal listening, or when I deliberately stress by putting on a very complex, treble infused track at high volume then it's 'wrong'.

fas42 · Apr 3, 2016

BE718 said:
Sorry fas but this couldn't be more wrong, unless of course you are referring to the most gross and obvious of defects.

Your hearing sensitivity changes with volume.

https://en.m.wikipedia.org/wiki/Equal-loudness_contour

To me, those defects are "obvious", because I very deliberately focus on listening for them. So I can hear them over a very wide volume range - and once I've zoomed in and noticed something then my consciousness has no problem "staying" with that artifact. What I will do is adjust the volume to see if the issue varies with level - usually a sign that a power supply is not optimum.

March Audio · Apr 3, 2016

fas42 said:
To me, those defects are "obvious", because I very deliberately focus on listening for them. So I can hear them over a very wide volume range - and once I've zoomed in and noticed something then my consciousness has no problem "staying" with that artifact. What I will do is adjust the volume to see if the issue varies with level - usually a sign that a power supply is not optimum.

Please provide evidence of that, beyond your subjective opinion. It is very likely to be the changes in hearing with level as described above.

fas42 · Apr 3, 2016

Probably the clearest instance of that was the battleship Perreaux amp I started this journey with - it had a very noticeable issue in that the level of distortion of the treble was highly dependent on the output level - a pretty normal behaviour for many older era amps, of course. There was a clear point in the acoustic level where the cymbal splash started to go dead in the sound, the harmonics just fell off the cliff - whether I was close to the speakers or far away made no difference. And, no, it was not the speakers - it was highly dependent on the material, a piece that had a massive treble transient after some softer material was fine - the power supply had enough charge storage to handle this type of demand.

Down the track I did major surgery to upgrade the energy storage of this amp, and then those problems went away. With current, decent amps those types of issues are far less prevalent.

Blumlein 88 · Apr 3, 2016

fas42 said:
Strangely enough, I have no volume control at the moment. More correctly, I do, but I run it at the end of the pot track, because I've changed the gain setting circuitry so that this means the pot is effectively out of the picture. Why? Because it's a crap, cheap Alps pot - at some stage I shall experiment with decent parts, to see if I can get "transparency" there.

Strangely enough, conventional cars don't have noise abatement issues - so we're not talking silly examples here. The point being, is that you listen for something being wrong, that's the idea - if I hear nothing wrong, or I'm not in a fussy, investigative mood then it's 'right'; but if I I hear problems in just normal listening, or when I deliberately stress by putting on a very complex, treble infused track at high volume then it's 'wrong'.

So now you might use a complex track at high volume to find a problem after saying volume did not matter.

fas42 · Apr 3, 2016

Blumlein 88 said:
So now you might use a complex track at high volume to find a problem after saying volume did not matter.

Because, I'm using "volume" to tickle out problems, stressing the system to encourage it to misbehave - issues arise for various reasons, some because the power supplies are not sufficient, others because the components are not robust enough against interference - and the "tactics" are different.

The principle is to ascertain whether the system will behave itself under all conditions, rather than just rely on a series of standard, relatively static tests.

Blumlein 88 · Apr 3, 2016

fas42 said:
Because, I'm using "volume" to tickle out problems, stressing the system to encourage it to misbehave - issues arise for various reasons, some because the power supplies are not sufficient, others because the components are not robust enough against interference - and the "tactics" are different.

The principle is to ascertain whether the system will behave itself under all conditions, rather than just rely on a series of standard, relatively static tests.

Which contradicts your previous statements about volume matching not being important. If you compare two devices or parts, the fact they might act differently at various volume levels is one reason to match. The other is you hear differently at different volumes which necessarily effects your ability to discern whether you understand that or not.

fas42 · Apr 3, 2016

You misunderstand. I'm using volume purely to elicit bad behaviour from a single system, not to compare things. I have no interest in comparing, at the moment - I just want to eradicate all flaws that cause audible artifacts in one particular system.

If I had two separate systems that differed and both had no audible issues, then I would be interested in comparing the sorts of things you have in mind - but I haven't reached that point of having two setups on the ground to try that; something for further down the track.

I'm certainly aware that the raw quality of a component of a system can shine through, even though the rig may still have issues - at times I hear a 'superior' quality in another person's set of equipment which I don't usually get, but that doesn't bother me ... my "shtick" is get the system in front of me to work to the best of its inherent ability.

Blumlein 88 · Apr 3, 2016

If you elicit bad behavior, then eradicate it you are comparing two different conditions though at different times. The only problems that will work with are very large ones. There are much better, consistent, and finer levels at which to work.

Jakob1863 · Jul 21, 2016

amirm said:
Statistics of ABX Testing
By Amir Majidimehr

In the table below, I have computed the answer for 10, 20, 40, 80 and 160 trials:

There exists a problem with the binom.inv function (or maybe in the description of the function, struggling with "at most" or "less than" ) as it returns apparently the number that actually means "greater than" , so instead of 8 correct answers 9 are needed (SL = 0.05), 15 instead of 14 and so on.
For example, the cumulative probability of P(X<8) = 0.9453 and therefore P(X>=8) = 0.0547 , so slightly above the line.

Phelonious Ponk · Jul 21, 2016

Blumlein 88 said:
If you elicit bad behavior, then eradicate it you are comparing two different conditions though at different times. The only problems that will work with are very large ones. There are much better, consistent, and finer levels at which to work.

You're wasting your time. You're not really even talking to Frank, you're talking to the voices in his head. It couldn't be a more futile endeavor.

Tim

fas42 · Jul 22, 2016

Tim, you ol' rascal you, can't keep you down, can we now? In Tim's world, there are things that you can hear ... and everything else is nonsense, right? Dearie, me ...

As someone who can't even hear that listening at 720p on YouTube makes a difference, you have zero credibility in terms of being able to distinguish audible variations in sound, I'm afraid.

amirm · Jul 22, 2016

Jakob1863 said:
There exists a problem with the binom.inv function (or maybe in the description of the function, struggling with "at most" or "less than" ) as it returns apparently the number that actually means "greater than" , so instead of 8 correct answers 9 are needed (SL = 0.05), 15 instead of 14 and so on.
For example, the cumulative probability of P(X<8) = 0.9453 and therefore P(X>=8) = 0.0547 , so slightly above the line.

Welcome to the forum. And yes there is rounding error there but as I said there is no magic in 95% that stops being so at 94.5%.

Jakob1863 · Jul 22, 2016

amirm said:
Welcome to the forum. And yes there is rounding error there but as I said there is no magic in 95% that stops being so at 94.5%.

Thank you very much for the welcome.
It seems to be a systematic error as the function returns always (means in the ~20 numbers i´ve tried) the number that is one count to low, so the description of the function is misleading. We are looking for the number of successes with a cumulative probability of >= 0.95 (at least 0.95), while the binom.inv delivers apparently the number of successes with a cumulative probability of <=0.95 (at most 0.95) .

Nevertheless you are absolutely right, there is no magic in the usual criteria hence it is often better to report the p-values and let the reader decide if 5,x% constitutes an unbearable risk while 4,x% does not (for example).

Statistics of ABX Testing

Blumlein 88

Grand Contributor

March Audio

Master Contributor

AJ Soundfield

Major Contributor

John Kenny

Addicted to Fun and Learning

fas42

Major Contributor

fas42

Major Contributor

March Audio

Master Contributor

fas42

Major Contributor

Blumlein 88

Grand Contributor

fas42

Major Contributor

Blumlein 88

Grand Contributor

fas42

Major Contributor

Blumlein 88

Grand Contributor

Jakob1863

Addicted to Fun and Learning

Phelonious Ponk

Addicted to Fun and Learning

fas42

Major Contributor

amirm

Founder/Admin

Jakob1863

Addicted to Fun and Learning

Similar threads