PK Error Metric discussion and beta-test

pkane · Jan 27, 2021

As many of you know, I created DeltaWave app for null-type differencing comparisons between two audio files: reference and comparison.

I'd like to propose a new error metric that I just added to DeltaWave. Due to a complete lack of product naming skills, I'm calling it the PK Error Metric

The goal is to compute a difference result that more directly answers the question of whether the difference between two devices is likely to be audible or not, and also helps pinpoint where the differences are likely to be more audible in the recording (may help pass ABX tests). It's certainly not the final word in perceptual metrics by any stretch of imagination. It'll likely evolve over time, in no small part based on the feedback from you. Oh, and if there's a metric already published that does exactly this, I'm happy to change the name to the correct one.

DeltaWave produces a number of charts and metrics to help find and explain any differences between two files, to measure if the differences are relatively small or large. A few current metrics are somewhat perceptually weighted, such as the A-Weighted RMS of the error (delta) file, but most are simply engineering numbers that are good for evaluating equipment, measuring relative differences, determining the cause of a specific imperfection, etc. Since many (most?) people use DeltaWave looking for audible differences, the lack of a more perceptually-accurate metric has been on my mind for a while.

What is PK Metric?
The metric itself is either a single RMS number or a chart, expressed in dB over time. While this is similar to the error signal in time domain, PK Metric is computed in the frequency domain over multiple, overlapping time windows of 400ms each. Here's what this looks like in the final result:

For each 400ms time window an STFT (Short-Time Fourier Transform) is performed for both, the reference and the comparison. The spectra of the two windows is corrected using equal loudness curves. This step uses an interpolated version of ISO 226:2003 curves and is designed to adjust the frequency weights according to their audibility at the current playback level. The level is estimated from the sound energy computed for each of the two windows being compared. For example, a 20kHz frequency will be weighted a whole lot less than 3kHz. I've thought of but not implemented a lower-threshold cutoff filter... Maybe something to discuss.

The adjusted frequency responses are then reduced using an ERB (Equivalent Rectangular Bandwidth) smoothing filter, and finally, each of the ERB buckets in the comparison is subtracted from the same ERB bucket in the reference. The energy of the resulting error spectrum is then summed to produce a single dB value representing how loud the error is at that particular 400ms interval. Because the original file samples are in dBFS, the PK Error Metric result is also in dBFS. An option is provided to also display the result in dBr, relative to the total energy level of the window. dBFS value tells you how loud the error signal is relative to the maximum sample value, while dBr tells you how loud it is relative to the actual signal (music, for example).

What's different?

Versus the RMS difference value already reported by DeltaWave, PK Metric represents a much more perceptually-weighted result. In many cases, it will be closer to the A-weighted difference (dBA) than to the RMS value, but in a lot of cases, it will be different there, as well.

Unlike the error waveform plot, PK Metric is much easier to read. For example, in the plot above it's pretty obvious that the error will be more audible just before the 7sec mark, and then, again, around the 11 second mark. Here's what the error waveform looks like for the above test, try to interpret this for audibility, and you may be picking wrong!

Unlike the DF Metric proposed by @Serge Smirnoff, PK Metric is not confused by small errors in timing or noise. For example, here are two 1KHz sine waves at different time offsets, at the same level, showing the delta waveform and the RMS null values. The naive RMS difference is very large, -9dBFS, since these waveforms are not aligned in the time domain. This may be mistaken for a very large, audible difference. And yet, the two sine waves at the same level and frequency really should sound the same!

Here's the PK Metric, doing the comparisons in the perceptually-weighted frequency domain. The result is nearly -160dB, and therefore completely inaudible:

Another example, the reference original 44.1KHz WAV file was low-pass filtered at 19KHz in Audacity. The comparison looks like this (RMS null of only -48dB):

And its perceptually-weighted PK Metric is at -72dB, more than 30dB down from the RMS null difference, indicating this is unlikely to be audible:

This result is much more of what one would expect, since the test signal is recorded music with not much happening above 19KHz, and, of course, human hearing not being sensitive at those frequencies.

Is PK Metric useful?
That's something I'm hoping to hear from you. Your thoughts on the design, any ideas for improvements, and any of your own test results. Gearslutz DA/AD loopback thread is a great source for WAV file comparisons. But, the numbers reported in that thread are RMS of the error signal, and thus, are useful as an engineering metric, but a little misleading if you want to know if the differences between two DACs or other components are really audible. My hope is that PK Error Metric can help answer these audibility questions a little better.

All feedback and suggestions are welcome!

MC_RME · Jan 27, 2021

Bravo! Bravo! Bravo! Waited for this since Diffmaker was abused to indicate issues where there had been none...

Blumlein 88 · Jan 27, 2021

So it happens in the digital world too. I've been known to keep items for more than 20 years. Then decide, "well, guess if I've not needed by now I never will." Almost invariably and improbably within two weeks of getting rid of it I'll need it. Maybe in the digital realm it is three weeks. Because 3 weeks ago I looked at all the files I'd used in the early days of testing Deltawave, differincing DACs, cables etc. decided I'd not need them again and deleted them. I do still have posted the original and 8th generation copy files.

pkane · Jan 27, 2021

MC_RME said:
Bravo! Bravo! Bravo! Waited for this since Diffmaker was abused to indicate issues where there had been none...

Yes, here's an example using an RME device. On Gearslutz DA/AD thread, @Archimago shared his RME ADI-2 Pro FS loopback recording. The RMS error was reported as -44.3dBFS for left channel:

FS version, unspecified setting (Archimago): 0.2 dB (L), 0.2 dB (R), -44.3 dBFS (L), -45.4 dBFS (R)

That's not that great for a high-quality device with objectively low noise and low distortion. Let's see what PK Metric reports for the same loopback file:

-81dBFSPK (new units

) Almost 2x lower, and below audibility.

pkane · Jan 27, 2021

Blumlein 88 said:
So it happens in the digital world too. I've been known to keep items for more than 20 years. Then decide, "well, guess if I've not needed by now I never will." Almost invariably and improbably within two weeks of getting rid of it I'll need it. Maybe in the digital realm it is three weeks. Because 3 weeks ago I looked at all the files I'd used in the early days of testing Deltawave, differincing DACs, cables etc. decided I'd not need them again and deleted them. I do still have posted the original and 8th generation copy files.

I lost most of my test files in a disk crash that took out a RAID array last year. Have been trying to rebuild the collection ever since

pozz · Jan 27, 2021

One thing I'm not clear about for the PKEM is whether the dB values are supposed to track playback level. So in your example of the -48dB RMS null of the 44.1 file vs. the PKEM of -72dB, am I supposed to treat the values the same way as I would a 0dBFS signal with distortion at -72dB?

Would it be clearer just to have a 0-1 correlation scale (1 being maximally clean)?

Edit: Nevermind, you answered this:

The adjusted frequency responses are then reduced using an ERB (Equivalent Rectangular Bandwidth) smoothing filter, and finally, each of the ERB buckets in the comparison is subtracted from the same ERB bucket in the reference. The energy of the resulting error spectrum is then summed to produce a single dB value representing how loud the error is at that particular 400ms interval. Because the original file samples are in dBFS, the PK Error Metric result is also in dBFS. An option is provided to also display the result in dBr, relative to the total energy level of the window. dBFS value tells you how loud the error signal is relative to the maximum sample value, while dBr tells you how loud it is relative to the actual signal (music, for example).

pkane · Jan 27, 2021

pozz said:
One thing I'm not clear about for the PKEM is whether the dB values are supposed to track playback level. So in your example of the -48dB RMS null of the 44.1 file vs. the PKEM of -72dB, am I supposed to treat the values the same way as I would a 0dBFS signal with distortion at -72dB?

Would it be clearer just to have a 0-1 correlation scale (1 being maximally clean)?

There are two display options in DeltaWave: dBFS or dBr. dBFS is the the level relative to 0dBFS. dBr is the level relative to the signal, in this case, -72dBr error level would indicate that the error is -72dB from the signal itself.

pozz · Jan 27, 2021

pkane said:
For each 400ms time window an STFT (Short-Time Fourier Transform) is performed for both, the reference and the comparison. The spectra of the two windows is corrected using equal loudness curves.

pkane said:
The adjusted frequency responses are then reduced using an ERB (Equivalent Rectangular Bandwidth) smoothing filter

If I understand correctly, you have a reference signal and a comparison signal with some sort of nonlinearity applied. You then use ERB-based smoothing to reduce the energy of both before comparing. This is the perceptual weighing part.

pkane said:
The level is estimated from the sound energy computed for each of the two windows being compared.

Do you mind explaining how you estimate the level?

For example, Rnonlin is computed for 80dB SPL and then has weightings applied for lower (but not higher) playback levels. (Don't ask me how it works because I haven't figured the details out.)

pozz · Jan 27, 2021

For the estimation of sound energy, you are not estimating it per ERB filter, correct?

Blumlein 88 · Jan 27, 2021

pkane said:
I lost most of my test files in a disk crash that took out a RAID array last year. Have been trying to rebuild the collection ever since

Losing a RAID array............ouch!

Blumlein 88 · Jan 27, 2021

Okay, here is one that is obviously audible. Used a ribbon microphone to record music 1.25 meters from my speakers. This is the result comparing the recording to the original. -24 db on the PK metric. What is the threshold for audiblity we are assuming here?

And the same using a condensor microphone. This one has much flatter response, and yet the result is not better according to the metric. I'd judge it closer the original comparing it to the ribbon version.

pkane · Jan 27, 2021

Blumlein 88 said:
Okay, here is one that is obviously audible. Used a ribbon microphone to record music 1.25 meters from my speakers. This is the result comparing the recording to the original. -24 db on the PK metric. What is the threshold for audiblity we are assuming here?
View attachment 108767

And the same using a condensor microphone. This one has much flatter response, and yet the result is not better according to the metric. I'd judge it closer the original comparing it to the ribbon version.
View attachment 108768

I'd say -50dB is the obvious limit. Anything below that would be questionable.

pkane · Jan 27, 2021

pozz said:
If I understand correctly, you have a reference signal and a comparison signal with some sort of nonlinearity applied. You then use ERB-based smoothing to reduce the energy of both before comparing. This is the perceptual weighing part.

Do you mind explaining how you estimate the level?

For example, Rnonlin is computed for 80dB SPL and then has weightings applied for lower (but not higher) playback levels. (Don't ask me how it works because I haven't figured the details out.)

There are two perceptual filters applied. First, an equal loudness curve to both, source and comparison windows. Then, ERB smoothing and a difference. "Estimated" was perhaps the wrong choice of words, it's computed from the total energy contained in the window after the perceptual filters are applied and the difference is computed.

Blumlein 88 · Jan 27, 2021

Just for comparison here are the results comparing a Cranesong HEDD adc and a Prism Dream ADC. Certainly inaudible to me.

bobbooo · Jan 27, 2021

pkane said:
Unlike the DF Metric proposed by @Serge Smirnoff, PK Metric is not confused by small errors in timing or noise.

In what way is it confused? As I linked to in this post, I thought Serge's DF Metric software accounts for this via an algorithm which finds the global minimum Df value for any input/output pair by iterating over all possible phase/time shifts, up to arbitrary accuracy (currently 0.0001 dB), only limited by how much processing power and time you have. I believe you incorporated (or at least approximated) this process in the DF calculation in DeltaWave? If so and the DF Metric already accounts for linear timing shifts, what advantage does the PK Metric have over it in this regard?

Blumlein 88 · Jan 27, 2021

Here is the original vs one of my 8th generation copies. An edge case at 50 db. I could abx it in certain small areas of the file. It was unclear if others could or not. As a group the responses to this were 50/50. The area I concentrated on to abx it in foobar were 13-15 seconds.

Here is the same with phase and EQ correction. I had reason to believe the audibility of this was due to the rippled FR in the top couple octaves.

pkane · Jan 27, 2021

bobbooo said:
In what way is it confused? As I linked to in this post, I thought Serge's DF Metric software accounts for this via an algorithm which finds the global minimum Df value for any input/output pair by iterating over all possible phase/time shifts, up to arbitrary accuracy (currently 0.0001 dB), only limited by how much processing power and time you have. I believe you incorporated (or at least approximated) this process in the DF calculation in DeltaWave? If so and the DF Metric already accounts for linear timing shifts, what advantage does the PK Metric have over it in this regard?

DF metric of two different white-noise files, for example, will report a huge error. PK Metric will not, because it's using frequency domain for error calculation and the spectra are similar.

bobbooo · Jan 27, 2021

pkane said:
DF metric of two different white-noise files, for example, will report a huge error. PK Metric will not, because it's using frequency domain for error calculation and the spectra are similar.

Isn't a potential problem with this, that the frequency spectra don't uniquely determine the files? A trivial, extreme example would be one track being identical to another but played backwards. Also, does it really matter that the DF Metric doesn't work well for white noise? After all, I thought one of the main advantages of these metrics was to be able to determine audio degradation of the actual music we listen to, instead of just test signals.

pkane · Jan 27, 2021

bobbooo said:
Isn't a potential problem with this, that the frequency spectra don't uniquely determine the files? An extreme example would be one track being identical to another but played backwards. Also, does it really matter that the DF Metric doesn't work well for white noise? After all, I thought one of the main advantages of these metrics was to be able to determine audio degradation of the actual music we listen to, instead of just test signals.

A reversed file will certainly not have the same spectral content as the original from PK Metric perspective. Remember, the comparison is done on overlapping 400ms time-domain windows.

The main difference between DF metric and PK metric is that DF is an engineering metric, PK is a perceptually-weighted one. DF reports all differences, regardless of whether they are audible or not. PK Metric, by design, incorporates audibility filters to reduce the influence of errors that are unlikely to be audible to a human.

There's a place for both, and in fact, both metrics are available in DeltaWave today, so you can compare them.

Blumlein 88 · Jan 27, 2021

Here is the 1st generation copy of the same track I posted earlier. Which I could never abx successfully. The difference on the 8th gen copy was -36 db RMS and is -49 db RMS on this first generation copy. The PK metric went from -50.8 to -69.9 db.

With level EQ it jumps to more than 100 db.

PK Error Metric discussion and beta-test

Master Contributor

Attachments

Addicted to Fun and Learning

Grand Contributor

Master Contributor

Master Contributor

Слава Україні

Master Contributor

Слава Україні

Слава Україні

Grand Contributor

Grand Contributor

Master Contributor

Master Contributor

Grand Contributor

Major Contributor

Grand Contributor

Master Contributor

Major Contributor

Master Contributor

Grand Contributor

Similar threads