Df Measurement

pozz · Jul 21, 2020

For @bobbooo, who keeps bringing the topic up in other review threads.

solderdude · Jul 21, 2020

https://www.audiosciencereview.com/...native-method-for-measuring-distortion.10282/

pozz · Jul 21, 2020

bobbooo said:
I think pretty much all those points could equally be levelled at SINAD though. I don't see how the Df metric would 'collapse' if it were measured at different levels, just as SINAD doesn't. And I'd see Df as a metric to use along side all the other measurements, not a single score to replace everything else.

How do you know what kind of performance is possible today, when none of the standard metrics measure total sound degradation when playing music (or a close analogue thereof in Program Simulation Noise)? For example, as can be seen from these measurements, the FiiO M11 has a slightly better (lower) Df value for a 1kHz sine wave (implying a higher SINAD) than the Questyle QP1R, yet the former has a significantly worse Df value when playing real music or the Program Simulation Noise. Sine tone tests evidently miss a significant part of the sound degradation of a device, which can result in SINAD being a misleading metric. Df however is guaranteed to take all such degradation into account.

As for the point about 'compression' of the Df scale, I don't really see this as much of a problem. It could easily be rescaled for more intuitive reading if necessary, as @MZKM has done for some graphs of the speaker scores. The Df calculation could even be run between the difference signals of two devices using DeltaWave, which would give a 'relative Df value' in order to more clearly judge the difference in sound degradation between devices.

The Df values on Serge's site aren't referenced to electrical values, SPL or psychoacoustic metrics. They are self-contained and map null results among each other. "Total sound degredation" measured this way is an abstraction.

The electrical setup is very specific. There would have be a lot of work put into a standard set of tests which would explain the Df metric. It would be time-consuming to say the least.

For example, I looked up Program Simulation Noise, which apparently is soft-clipped filter-shaped pink noise. What device behaviour is causing the Questyle QP1R to react to that signal? Is it clipping? Is the DAC filter inadequate?

What is the Df between Program Simulation Noise and regular pink noise? What is the Df between pink noise or white noise generated by different sources? [Edit: I meant between different digital sources. Because pink/white noise works according to a probability function, there will be differences in the signal despite sounding the same.] What about Df for the same signal, with one is attenuated by -10dBFS?

Apparently the Program Simulation Noise Df for the Chord Hugo 2 is -32.6dB, the best result, and is -25.6dB for the FiiO M11, out of eleven DUTs. What does that ~7dB range mean?

Unless I've misunderstood, the Df "sound signature" can't be mapped audibly and it can't be used to diagnose engineering problems.

Vasr · Jul 22, 2020

I am still trying to understand this concept because I like the motivation and intuitively see the rationale for the measure. You are correct about caution before making interpretations of that measure. I am not in any way an acoustics science expert, far from it. But. I do think it is worthwhile to develop this idea by answering some of the questions you have raised. First, by people understanding what it is and then trying out a few things. I agree that where it is lacking is in some further calibration and test for consistency and robustness as a measure before even going so far as to interpretation.

pozz said:
The Df values on Serge's site aren't referenced to electrical values, SPL or psychoacoustic metrics. They are self-contained and map null results among each other. "Total sound degredation" measured this way an abstraction.

Yes, the way I intuitively understood it is as a type of "correlation measure" between two audio clips (independent of the metric used).

To establish that it is well-defined and robust (I don't know which of these have been done)

1. The Df between two identical clips measured through the same set up should be zero (or at the margin for error in measurements). For this to be valid, you would need to pass the same clip through the same set up to measure multiple times and ensure that you get similar results and any differences will establish the margin for error.

2. Test to make sure which other parameters it is invariant to. My understanding is that equalizes the input and output to same peaks (or rms?) and minimum phase difference for time alignment. This should be validated by feeding same clips at different levels and different phases to the algorithm to make sure that the output is invariant to those parameters.

3. What standardized input must be used to measure devices with this algorithm to establish the "goodness" of one over the other? This is likely to be the most contentious issue.

Apparently the Program Simulation Noise Df for the Chord Hugo 2 is -32.6dB, the best result, and is -25.6dB for the FiiO M11, out of eleven DUTs. What does that ~7dB range mean?

Unless I've misunderstood, the Df "sound signature" can't be mapped audibly and it can't be used to diagnose engineering problems.

To answer these, the Df framework would need to

Show with careful experiments the threshold above which they can be audibly differentiated but not below. But then anecdotal tests with THD seem to show (from the link above) that differences cannot be audibly perceived until it gets to very high levels, far above most of the devices tested and yet we squabble over differences several orders of magnitude below that, so that may be asking too much of Df.

What would be the ultimate validation of Df metric is to come up with a few sets of inputs that measure same in THD but are audibly perceived to be different under strict A/B testing but Df shows them to be different. That would make it a useful metric to use.

Until then, I am not sure what the numbers mean either. I can always come up with many different way to calculate some "number" between two clips and most of them would not have any meaningful interpretation or show correlation with anything physical.

pozz · Jul 22, 2020

@Vasr I have the same position as you.

Some sort of "cognitive mapping" that would help us establish the right form of interpretation still needs to be done.

pozz · Jul 22, 2020

I would appreciate an Excel sheet of the metrics here: http://soundexpert.org/articles/-/blogs/gearslutz

Anyone interested in putting it together?

pozz · Jul 22, 2020

Notes so far:

Serge has attempted to correlate Df to subjective results through a large amount of voluntary online listening tests: http://soundexpert.org/articles/-/b...asurements-to-predict-listening-test-results-
He considers a 1.5dB to 2dB difference in Df a significant subjective difference.
He considers -50dB Df transparent.
A Df measurement using a single sine is equivalent to the tested device's THD+N - 3dB.

Serge's summary:

Important part is "levels of distortion of various signals may be independent" and "any objective measurements using purely technical signals have limited validity".

Blumlein 88 · Jul 22, 2020

pozz said:
The Df values on Serge's site aren't referenced to electrical values, SPL or psychoacoustic metrics. They are self-contained and map null results among each other. "Total sound degredation" measured this way is an abstraction.

View attachment 74538
The electrical setup is very specific. There would have be a lot of work put into a standard set of tests which would explain the Df metric. It would be time-consuming to say the least.

For example, I looked up Program Simulation Noise, which apparently is soft-clipped filter-shaped pink noise. What device behaviour is causing the Questyle QP1R to react to that signal? Is it clipping? Is the DAC filter inadequate?

What is the Df between Program Simulation Noise and regular pink noise? What is the Df between pink noise or white noise generated by different sources? [Edit: I meant between different digital sources. Because pink/white noise works according to a probability function, there will be differences in the signal despite sounding the same.] What about Df for the same signal, with one is attenuated by -10dBFS?

Apparently the Program Simulation Noise Df for the Chord Hugo 2 is -32.6dB, the best result, and is -25.6dB for the FiiO M11, out of eleven DUTs. What does that ~7dB range mean?

Unless I've misunderstood, the Df "sound signature" can't be mapped audibly and it can't be used to diagnose engineering problems.

Do I read that setup correctly. He is always nulling the left against the right channel? Why would you do that instead of a conventional null test?

bobbooo · Jul 22, 2020

pozz said:
For @bobbooo, who keeps bringing the topic up in other review threads.

Thanks! I was about to create a thread myself actually. It was just two review threads by the way

Both in response to related comments from other members - the first asking Amir directly if he could do null difference tests using DeltaWave (which also computes the Df metric), and the second a discussion about whether the high measured SINAD (and other standard metrics) of the Okto dac8 make it the 'best DAC in the world'. My point there was that could not be determined without quantifying the DAC's performance when playing real music (or a close analogue thereof in the form of an already standardized test signal which was developed to have similar spectral content to music i.e. the Program Simulation Noise). But I agree those two threads were getting overloaded with this discussion which deserves its own thread. (Unfortunately it seems the previous thread on the topic devolved into ad hominem and only tangentially related arguments, so I think a fresh start is warranted.)

pozz said:
The Df values on Serge's site aren't referenced to electrical values, SPL or psychoacoustic metrics. They are self-contained and map null results among each other. "Total sound degredation" measured this way is an abstraction.

View attachment 74538
The electrical setup is very specific. There would have be a lot of work put into a standard set of tests which would explain the Df metric. It would be time-consuming to say the least.

For example, I looked up Program Simulation Noise, which apparently is soft-clipped filter-shaped pink noise. What device behaviour is causing the Questyle QP1R to react to that signal? Is it clipping? Is the DAC filter inadequate?

What is the Df between Program Simulation Noise and regular pink noise? What is the Df between pink noise or white noise generated by different sources? [Edit: I meant between different digital sources. Because pink/white noise works according to a probability function, there will be differences in the signal despite sounding the same.] What about Df for the same signal, with one is attenuated by -10dBFS?

Apparently the Program Simulation Noise Df for the Chord Hugo 2 is -32.6dB, the best result, and is -25.6dB for the FiiO M11, out of eleven DUTs. What does that ~7dB range mean?

Unless I've misunderstood, the Df "sound signature" can't be mapped audibly and it can't be used to diagnose engineering problems.

I'll repost my answers from the previous threads for reference as I've gone over some of this before:

bobbooo said:
It's the Df metric Serge of SoundExpert uses i.e. the ratio of the RMS level of the difference signal to the RMS level of the original test signal, as defined in his AES paper:

bobbooo said:
They're [SINAD and the Df metric] both effectively signal to 'noise' (in its most general sense of unwanted sound) ratios, or a 'noise' to signal ratio in the case of the Df metric. The latter just seems a more generalized version of SINAD to me, taking into account all signal degradation instead of just THD+noise (just actual noise), and applicable to any input signal, including real music.

So if the Df metric is an abstraction, SINAD is just as much so. Serge does actually specify reference levels for playback tests - the same maximum level recommended by the EN 50332-2 standard (see his full methodology at the bottom of this page). It could be argued this is an arbitrary choice, but the same thing could be said about the level used for SINAD measurements. Note: the EN 50332-2 level was chosen by Serge because he was initially interested in testing portable devices, which the standard was made for. The standard also specifies testing with 32 ohm loads, I presume to simulate an average pair of headphones. (As almost all modern portable players follow this standard these days, just setting them to max volume will usually yield the same 150mV level after the 32 ohm loads, as Serge's testing diagram prescribes). But this does not mean the Df metric could not also be used for larger DACs/amps used for speaker playback - standard test levels just need to be chosen (again, just like SINAD), and the 32 ohm loads would not be needed. Testing portable players also without loads would give their line-out performance anyway, so that would still be useful and easily obtainable data, if adding loads would be too time consuming (although that would just be a one-time soldering job really).

I presume you meant what device behaviour is causing the FiiO M11 (not the Questyle QP1R) to 'react' badly to the Program Simulation Noise (hereafter PSN), yet not the sine signal? (The inverse is true for the QP1R.) I think that's the beauty of the Df metric in a way - it highlights all possible sound degradation when playing actual (or simulated) music, some of which we may not currently know the cause or mechanism of. In the FiiO M11's case I can't imagine it's clipping otherwise I would have thought the sine Df would also be adversely affected, no? Maybe it's come kind of as yet unknown nonlinear effect due to the complexity of the PSN/music waveform, who knows. Of course, from an engineering perspective this would be useful to know, but for ranking sound degradation (what the Df metric was intended for) not really - all that matters are the correlations between the inputs and outputs, everything else can be a black box for that purpose.

To your point about generated pink noise (and so the PSN) not being identical due to probability functions, this can easily be overcome by just using the same identical source file for all tests, such as the ones pre-generated and included in Audio Precision's Audio Player Test Utility. As for the Df between the PSN and pink noise, that could be determined through the generation of the PSN from a known pink noise file, saving both and running them through DeltaWave to compute the Df. (Not exactly sure why you want to know this value though.) Linear level (as well as time shift) differences are adjusted for in the Df computation, so two signals with their only difference being a 10dBFS attenuation would have identical Df values.

Can you say what a 7dB difference in SINAD really means intuitively? That's not that easy for me, and I don't think you can even get a feel for that for Df values until a large number and range of devices are measured. What you can see (and hear) using DeltaWave that is intuitive to understand is the actual difference signal between the original and recorded sound file produced by any device, which is quite fascinating. If you listen to a difference signal of real music and turn your headphones/speakers up, you can actually still hear the form of the original music, and comparing these difference signals between devices, hear the differences in level and noise in this signal across DUTs, directly listening to the degradation the devices are imparting on your music. Thinking about this, it may be possible to actually work out a limit on perceptible relative Df values between devices, by for example ABXing difference signals of ever-closer Df value, until they can no longer be distinguished. Of course this doesn't take into account perceptual masking when listening to real music, but it could be a useful hard lower limit at which it can be safely said two devices with Df values closer than this limit will have comparable levels of degradation to your ears (the limit could even be individual depending on your performance in the ABX test). In a similar way, a hard lower limit on absolute Df value audibility, and so pretty much guaranteed transparency with music, could be determined by ABXing difference signals of ever-decreasing Df value against digital silence, ending up with whatever the Df equivalent of the often quoted ~120dB limit on SINAD audibility turns out to be. Additionally, Serge is also attempting to quantify the correlation between null difference measurements and listening tests results here, which has shown some promising results. Is there strong quantifiable evidence for the correlation between SINAD and listening tests? This AES paper by Steve Temme and Sean Olive doesn't sound too promising (my emphasis):

In summary, there appears to be some moderate positive correlations between the amount of THD measured in the headphones and their sound quality rating.

In summary, among the distortion metrics we chose in this study, non-coherent distortion based on music appears to be to be more correlated with listeners’ preference ratings than the THD, IM and Multitone.

Finally, this study provides further experimental evidence that traditional nonlinear distortion measurements are not particularly useful at predicting how good or bad high caliber headphone sounds.

I personally see the Df metric being most useful as an objective, pure measure of audio signal degradation though i.e. a natural extension and expansion of SINAD to encompass all unwanted changes in the electrical audio chain, and using real (or simulated) music instead of test tones for a more accurate relation to real-world listening.

bobbooo · Jul 22, 2020

pozz said:
I would appreciate an Excel sheet of the metrics here: http://soundexpert.org/articles/-/blogs/gearslutz

Anyone interested in putting it together?

I could give that a go.

bobbooo · Jul 22, 2020

Blumlein 88 said:
Do I read that setup correctly. He is always nulling the left against the right channel? Why would you do that instead of a conventional null test?

That's just a schematic of the (2-channel stereo) recording set-up. The actual nulling is done digitally (e.g. using @pkane 's brilliant DeltaWave program) between the recorded test file played through the DUT and the original source file.

amirm · Jul 22, 2020

So it is clear, I have no interest in adopting this measure. It is not an industry standard. There are no published studies regarding its correlation to audibility. And it is very hard to interpret. So please don't put work on my table.

Anything you want to do with it, you should do on your own.

Blumlein 88 · Jul 22, 2020

@bobbooo are you familiar with the null testing thread on Gearslutz? A standard file is played and it is nulled against the original. It was part of the inspiration for Pkane's Deltawave. The thread used Diffmaker which is problematic and Paul saw he could do better. And he did of course much better.

I've read a little of the threads on DF, and it never made sense to me what we are gaining much less that it should become a standard measure. It seems a more complex way of doing basic nulls with no added benefit I could discern. So what about it is worth pursuing?

I see Amir's response above which seems to capture my thoughts about it right now.

solderdude · Jul 22, 2020

I have experience with analog nulling and it can be used to listen to differences.
There are issues with the method which are phase, timing (in case of sampling) and amplitude errors are all translated into amplitude differences.
Small phase differences (always at the extremes of the audio bandwidth) thus produce an error 'result' but are completely inaudible.
This means that the only thing you probably have to do 'normal' measurements to quantify this and take this into account.
With software you can look for phase differences and I assume calculate them out but one would have to determine audibility of it first to do so.

You can also listen to the null result to 'hear' if the diff is 'nasty' sounding or 'natural' sounding.
When it is nasty the sound degraded, when it sounds 'natural' it may not even be audible. A Df number thus may not say nearly as much as listening to it. Also the quality of the ADC is sure to have an impact in the number as well. Even when nulling between L and R.

Furthermore, like with speaker amp measurements you can get different results with different loads.
As Amir touched, you need a standard and not 1000 standards.
Headphone A is a totally different load as headphone B and both an be different from a 33 Ohm resistive load.

That's what I liked about nulling amps and cables. You can null using actual loads. Then there is the issue that when an output resistance is a bit higher than usual there will be a substantial null difference and sound may have changed but not audibly degraded as only some frequencies will be different in amplitude.

I see the method as a fun and educational exercise and its usefulness is in the test signal (music), actual load and you can 'listen' to the Df results to determine what the audible difference is.
With enough analysis and blind listening tests I reckon there might be some relevance found to perceived SQ when it is found what aspects in a null are 'sound degrading' and what is causing them.

I wish those experimenting lots of success and wisdom and hope they put their ears to good use evaluating Df signals before drawing conclusions based on generated numbers.
The same is true for those correlating SQ to SINAD or other single number metrics. Especially when the numbers are well beyond audible limits.

Vasr · Jul 22, 2020

Perhaps, rather than try to make this into something to be pushed to Amir to use, we can use this forum to get a better understanding of the process and see how we can help advance it if it seems useful. I am certainly interested in hearing what people who know what they are talking about discuss the pros, cons and limitations of this technique. It is an important direction I would think.

There is a kind of analogy in the video world. People have tried to come up with algorithms that would compare two lossy encodings (say H.264 or HEVC) of the same source and derive metrics on which one is likely to be perceived as better quality by a viewer. It has the same issues as audio in trying to relate it to human perception not just some synthesized number. Bitrate, resolution, file size, etc are not reliable indicators of perceived quality.

The closest to a standardization there for a metric is an ITU-T Recommendation for algorithms (at 4 levels of analysis) for streamed audio and video compared to the original stream.

https://www.itu.int/rec/T-REC-P.1203-201710-I/en

It may have an audio section that is relevant.

For a very good analysis of what has been done and is available see the top answer at

https://superuser.com/questions/338725/compare-two-video-files-to-find-out-which-has-best-quality

Perhaps some of these have already looked at the issues such tests by comparison to a reference will encounter that might be useful in trying to relate it to human perception.

solderdude · Jul 22, 2020

The whole issue is to separate 'nasty' sounding alterations from 'pleasant' sounding alterations of the applied 'signal'.
To do this you need to quantify this, which falls under perceptions which in itself is a hornets-nest because of brain, training, preferences, levels, used transducers, interactions between transducers and drivers (amps, not the speaker units) etc.

I think many research already has been done and haven't yet seen conclusive evidence what metrics correlate to certain 'alterations' in sound and in what circumstances they may be benign and in what circumstances (recordings) they may be detrimental and to who and why under what conditions.

Certainly for amplifiers and the incredibly wide range of speakers (and headphones in a lesser degree) the combinations are infinite.
Therein lies the problem. An amp can perform well on speaker A but not on speaker B.
Does this make it a poor performing amp and needs to be dismissed or is it just not compatible with certain speakers but otherwise a great sounding device ?

I understand the wish to correlate but there is too much variables. For this reason alone (and a severe lack of available spare time for a witch hunt) I won't be participating. When I were to get into this I would have to quite my job and devote to well performed blind tests and have access to unlimited speaker/amp combo's.

For the above reasons I would not bet on Df metrics becoming the best way to determine SQ is going to phase out current measurements.
I will applaud all efforts from afar.

JohnYang1997 · Jul 22, 2020

On AP, you are limited to the performance with no notch filter. You immediately shoot yourself in the foot by doing so. AP is not god. It can only do so much. While many analogue amplifier circuits/ic can already achieve -160dB or lower distortion.

If you really want to do this, you have to design a whole automatic system to perform auto nulling. And every stage in the circuit needs to be better than -130dB in distortion and less than 1uV in noise. The stage after nulling will probably need even lower noise like 0.2uV.

Simply put, some performance is just miles ahead than what you can imagine. You lose track on how small the distortion it is so you only saw numbers. No, they really have such low distortion. Using mulling method is much much worse in looking for these small distortions and noise.

I do not recommend using nulling for anything that has distortion lower than -100dB.

pkane · Jul 22, 2020

bobbooo said:
That's just a schematic of the (2-channel stereo) recording set-up. The actual nulling is done digitally (e.g. using @pkane 's brilliant DeltaWave program) between the recorded test file played through the DUT and the original source file.

@bobbooo , the df measure as proposed by Serge was added to DeltaWave as a test. It was a way to get some validation that the measure is useful to gauge how audible a level of distortion is. In my testing, it really didn't do much better than the simple RMS of the difference that DeltaWave has been computing since the beginning. DF metric is different in that the computation is done on a shorter time scale (400ms or so) instead of on the whole capture.

The shorter time scale makes it a little more likely to produce a result that is better correlated to audibility than the whole track difference. But, it's really not that much different in my tests. Some of the more serious criticisms brought against df by others included the fact that the distortions are not being measured in accordance with known psychoacoustic effects, such as, for example, masking or equal loudness curves.

In effect, DF might be a slightly better metric than the overall RMS of the difference, but it is primarily an engineering metric and not a perfect predictor of audibility. In my opinion, of course

bobbooo · Jul 22, 2020

amirm said:
So it is clear, I have no interest in adopting this measure. It is not an industry standard. There are no published studies regarding its correlation to audibility. And it is very hard to interpret. So please don't put work on my table. Anything you want to do with it, you should do on your own.

Serge's AES paper describes studies which showed good correlation between Df value and listening test results. From the paper:

I'm not aware of equivalent studies showing such quantitative correlation for SINAD. Sure Df is not an industry standard, but nevertheless I think it does hold some promise for elucidating audio degradation when playing real/simulated music that standard measurements can't, so I think it's an interesting metric to explore that could push audio measurement further than standard measures. To be honest I think my waffling posts have made the metric seem more complicated than it is. The actual measurement is very simple - just a recording using a high-quality ADC of the output of the DUT playing the 30-second Program Simulation Noise file. Any Df computation or analysis can be done independently if the recording is publicy uploaded. I understand if you don't want a single thing more to do though, no matter how straightforward - I don't know how you find the time to eat and sleep as it is!

Vasr · Jul 22, 2020

pkane said:
In effect, DF might be a slightly better metric than the overall RMS of the difference, but it is primarily an engineering metric and not a perfect predictor of audibility. In my opinion, of course

To play a devil's advocate, something doesn't need to be perfect in order to be useful (see Nirvana fallacy).

Are there justifiable reasons to believe that it is uncorrelated with audibility (as a random measurement would be). There is certainly a thesis that it is correlated with audibility with some empirical testing. How can one prove or disprove this?

Df Measurement

Слава Україні

Grand Contributor

Слава Україні

Major Contributor

Слава Україні

Слава Україні

Слава Україні

Grand Contributor

Major Contributor

Major Contributor

Major Contributor

Founder/Admin

Grand Contributor

Grand Contributor

Major Contributor

Grand Contributor

Master Contributor

Master Contributor

Major Contributor

Major Contributor

Similar threads