• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

PK Error Metric discussion and beta-test

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,524
Likes
37,057
Yes, here's an example using an RME device. On Gearslutz DA/AD thread, @Archimago shared his RME ADI-2 Pro FS loopback recording. The RMS error was reported as -44.3dBFS for left channel:



That's not that great for a high-quality device with objectively low noise and low distortion. Let's see what PK Metric reports for the same loopback file:
View attachment 108759

-81dBFSPK (new units :) ) Almost 2x lower, and below audibility.

Here are my results for the Antelope Audio Zen Tour on the gearslutz material. Originally reported as -56.9 db on the left channel which is 12.6 db better than the RME results. On the PK metric the table is turned with the RME scoring 9.1 db better.

1611772260024.png
 
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
Isn't a potential problem with this, that the frequency spectra don't uniquely determine the files? A trivial, extreme example would be one track being identical to another but played backwards. Also, does it really matter that the DF Metric doesn't work well for white noise? After all, I thought one of the main advantages of these metrics was to be able to determine audio degradation of the actual music we listen to, instead of just test signals.

Wanted to share this earlier. DF Metric for the RME ADI-2 Pro FS loopback, the same one I posted earlier in the thread from Gearslutz, captured by Archimago:
1611775202491.png


The different error measurements are:

RMS Null metric (engineering): -44.3 dBFS
DF Metric (engineering): -32.3dBFS
PK Metric (perceptual): -81.4dBFS

All of these are expressing the magnitude of the error signal, all are reported by DeltaWave, so you can chose which one you prefer. I have very little doubt that a loop-back recording using RME ADI-2 Pro FS will not be audibly different than the original. To me, PK Metric represents a more realistic result when looking for audibility. But then, I'm biased ;)
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,524
Likes
37,057
I think you should remove the correlated null myself and put the PK metric in its place. Even didier on gearslutz has dropped the correlated null in his listings. I usually look at the difference and difference A-wtd.

I've noticed using several files now, that if I use non-linear calibration with Level EQ and Phase checked the difference nulls improve (sometimes by a large amount), but the PK metric improves only a little. Yet if I only do Level EQ while leaving Phase unchecked, the difference null may hardly improve, but the PK metric will improve a large amount (20db better). I'm not sure I see why that is happening. Any thoughts on that?
 
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
I think you should remove the correlated null myself and put the PK metric in its place. Even didier on gearslutz has dropped the correlated null in his listings. I usually look at the difference and difference A-wtd.

I've noticed using several files now, that if I use non-linear calibration with Level EQ and Phase checked the difference nulls improve (sometimes by a large amount), but the PK metric improves only a little. Yet if I only do Level EQ while leaving Phase unchecked, the difference null may hardly improve, but the PK metric will improve a large amount (20db better). I'm not sure I see why that is happening. Any thoughts on that?

Yeah, I find the correlated null value sometimes useful, but rarely accurate, so always have to double-check using other measurements when something strange is reported there.

Hard to know without seeing the actual files and your non-linear EQ settings. Sometimes the settings might be causing more harm than good, which is why I advise that they be used very carefully. A large FFT size and/or a short file, for example, can result in a much noisier correction due to fewer FFT windows being averaged. This can also sometimes add unintended distortion.
 
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
Here is a good example. Results of my 18i20 clocked by the Zen Tour and using the original2.wav from the gearslutz thread.

Here is the file. I assume you have the original2 file.
https://www.dropbox.com/s/b8sdifvorx9s1ol/18i20 zt clk lock.zip?dl=0
View attachment 108866

View attachment 108867

View attachment 108868

View attachment 108869

Thanks for the file, Dennis! The thing I see is that between using the non-linear Level EQ and or Level + Phase EQ is that there are a few larger amplitude errors in the result when non-linear phase correction is included. This is likely why the PK Metric reports a worse value. This may have to do with noise or some other artifact in the recording, it's hard to tell with the non-linear corrections since they use the entire spectrum as the correction value, so it's not a single number or a simple curve that can tell a simple story. Here's the amplitude error with Level EQ only:

1611797860011.png



And here it is with level + phase correction:

1611797888513.png


You'll notice that in the level + phase plot, the error is mostly lower and more consistent, except for three very large spikes. These are the errors that phase non-linear EQ is introducing. It seems to be working well for most of the recording, except for those three spikes.

The phase error does appear a bit non-linear, though, so with proper non-linear correction I'd expect the result to improve:
1611798251052.png
 

MC_RME

Addicted to Fun and Learning
Technical Expert
Audio Company
Joined
May 15, 2019
Messages
855
Likes
3,566
From my former testing (long ago) there are multiple ways to get second generation files that are screwed up for analysis. Examples using the ADI-2 Pro in analog loopback:

- Setting input and output to the same reference level. Seems fully logical first, but if you look closer (the input level meters) you will find that today's music always includes intersample overs. Only slight ISPs that cause zero harm on the DA side (it renders fully undistorted up to +2.5 dBFS, and undistorted within the audio band to more than +4 dBFS), the AD side has a strict limit at the chosen ref level, so the recorded file includes hard clipping. Small peaks only, but that seemed to have thrown off DW a lot.

- You can set the input ref level one step higher, the recorded file is therefore 5 or 6 dB lower in level, which makes you loose analysis resolution. Or you reduce the level on the original file by this amount during playback. The latter makes you loose the orginal file as reference for DW's comparison.

- The choice of AD and DA filters. AFAIR I had best results with the Sharp linear phase filter on both, but have to check this again. Default is phase minimum, though. Small phase shifts are inaudible but caused Diffmaker to present values that did not have any relation to audibility. DW worked much better (due to its included phase compensation feature), but still the result did not match audibility or gave numbers that were somehow meaningful. That should have changed now.

Thanks Paul for this version. I'll sure play around with this more again.
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,524
Likes
37,057
Could you include a chart or graph of the ERB buckets. Maybe original, comparison, and a Delta bucket chart for each ERB range. Even listing it as a numerical chart might be useful into to seeing whether or not it gets tricked by certain situations. And to understand which kinds of results are more audible than others. Being able to select a certain portion of the PK metric graph and see the ERB results for that selected section would be really useful. But that might be huge chore to create.
 
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
Could you include a chart or graph of the ERB buckets. Maybe original, comparison, and a Delta bucket chart for each ERB range. Even listing it as a numerical chart might be useful into to seeing whether or not it gets tricked by certain situations. And to understand which kinds of results are more audible than others. Being able to select a certain portion of the PK metric graph and see the ERB results for that selected section would be really useful. But that might be huge chore to create.

I thought about something like that. Maybe more of a "real-time" graph that lets you step through each of the 400ms windows to see what went into computing the single number for that point. It will be a bit of work, but I'll see what I can do.
 

KSTR

Major Contributor
Joined
Sep 6, 2018
Messages
2,690
Likes
6,013
Location
Berlin, Germany
Just started playing with this... looks like a well thought out effort, excellent!

Obviously the overall result is highly dependent on match settings as you are effectivly analyzing the residual. So with everthing full on (non-lin level and phase EQ) I'm getting a result in the -110dBr range for a match of original vs. loopback recording whereas when it's off it's "only" 80dBr or so.

What I was about to ask, what influence do the various window choices and FFT sizes have, in general, any recommendations for deepest match and cleanest residual? Gut feeling seems to suggest turn up FFT size to max, and I like cosine windows so I've selected cosine23 term for everthing.
As DW kills some serious CPU time it's not that convenient to compare (though I just found out one can run multiple instances of it).
 
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
What I was about to ask, what influence do the various window choices and FFT sizes have, in general, any recommendations for deepest match and cleanest residual? Gut feeling seems to suggest turn up FFT size to max, and I like cosine windows so I've selected cosine23 term for everthing.
As DW kills some serious CPU time it's not that convenient to compare (though I just found out one can run multiple instances of it).

For PK Metric specifically, there's no influence from any of the FFT settings that normally apply to other DeltaWave features. The FFT size is automatically determined by the sampling rate and the 400ms window size. FFT Window is BlackmanHarris7, and currently not changeable.
 

DDF

Addicted to Fun and Learning
Joined
Dec 31, 2018
Messages
617
Likes
1,355
DF metric of two different white-noise files, for example, will report a huge error. PK Metric will not, because it's using frequency domain for error calculation and the spectra are similar.

Hi Paul, great work!
I'm trying to better understand the use case for PK-error. Is it to compare two files, one into a back box, one out of the black box? ie evaluate the time variant non linear transfer of a DUT? If so, comparing two noise noise files should work reasonably well in the time domain if they were time aligned first (using correlation). Just trying to understand your point that frequency domain analysis is needed

I have a suggestion that could help improve the algorithm. Instead of using a power integration over a fixed 400ms window, have a look into using a window that represents the integration time of the auditory system to build up loudness perception. Back when I worked in Telecom, I came up with an algorithm that would measure the time duration of an echo burst, and then perceptually weight it: if the burst duration was less than the loudness integration time of the auditory system, it would lower its representation of the level of the burst. This allowed the echo cancellation system to then use less switched loss (also called center clipping or noise clipper) and significantly enhance the ability of the system to retain unclipped full duplex speech even in the presence of short echoes. We ran full DBTs on it to assess its ability to keep echo audibility below perceptual thresholds even if allowing through higher than typical bursts. The DBTs were standards compliant, statistically significant etc, same as Harman does. In fact I think the lead auditory psychologist that defined the tests also worked on one of Harman's tests in the 90s. Mentioning just to say I trust the tests. We saw decent correlations back, meaning the perceptual loudness integration time hook worked OK and had some value.

One last suggestion. Instead of generating your own perceptual metric, it might be worth looking into implementing a standardized approach like ITU's PEAQ (overview here). It has the benefit of independent industry benchmarking in DBTs. Maybe there's a public domain code base available?
 
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
Hi Paul, great work!
I'm trying to better understand the use case for PK-error. Is it to compare two files, one into a back box, one out of the black box? ie evaluate the time variant non linear transfer of a DUT? If so, comparing two noise noise files should work reasonably well in the time domain if they were time aligned first (using correlation). Just trying to understand your point that frequency domain analysis is needed

The reason for frequency domain comparison is that our hearing is tuned to frequency discrimination as one of the primary mechanisms of identifying and differentiating sounds. Ears don't do sample by sample comparison in the time domain, which is what RMS error metric or DF metrics do.

I have a suggestion that could help improve the algorithm. Instead of using a power integration over a fixed 400ms window, have a look into using a window that represents the integration time of the auditory system to build up loudness perception.

That's actually where 400ms value comes from. Standards-based loudness measurements, such as LUFS, for example, are based on 400ms windows. So was DF Metric, which is why I picked it. If there's a better window size for loudness integration, I'd love to see some research on this.

One last suggestion. Instead of generating your own perceptual metric, it might be worth looking into implementing a standardized approach like ITU's PEAQ (overview here). It has the benefit of independent industry benchmarking in DBTs. Maybe there's a public domain code base available?

PEAQ is interesting in that it is similar to PK Metric, applying the same concepts, except for small differences in implementation. For example, I use the ISO 226:2003 equal loudness curves and interpolate between them to compute the frequency-related response, PEAQ uses a simple functional approximation to the threshold of hearing. PEAQ appears to use Bark scale while I use ERB. But it would be interesting to see a side-by-side comparison of the two. I wouldn't be surprised if they produce similar results, at least the designs appears to be similar. I'll dig in more into PEAQ to see if there's anything significant I'm missing. Thanks for pointing it out!
 

DDF

Addicted to Fun and Learning
Joined
Dec 31, 2018
Messages
617
Likes
1,355
The reason for frequency domain comparison is that our hearing is tuned to frequency discrimination as one of the primary mechanisms of identifying and differentiating sounds. Ears don't do sample by sample comparison in the time domain, which is what RMS error metric or DF metrics do.

Frequency domain discrimination alone doesn't account for forward masking, or the much lesser effective backwards masking. For example, take two different dynamic (eg signal burst) signals that result in the same 400ms average. What we hear will be very different if the first has all energy lumped into the first 100ms of a 400ms window, vs if it is comprised of two bursts, a lower level incident at the front end of the average, with a gap (eg 200ms) followed by an echo, or reflection. I think one unique and very powerful benefit of your metric resides in its ability to detect and uncover time variant processes, intended or otherwise. Summing them into a 400ms average will significantly obscure that resolution.

That's actually where 400ms value comes from. Standards-based loudness measurements, such as LUFS, for example, are based on 400ms windows. So was DF Metric, which is why I picked it. If there's a better window size for loudness integration, I'd love to see some research on this.

Hmm, let me dig through my old notes. This was 25 yrs ago but IIRC, it was frequency dependent and shorter than this. Audiolense for example uses a fixed # cycle window for its in room measurement algorithm for this reason, and REW and MLSSA support similar windows

PEAQ is interesting ...Thanks for pointing it out!

You're welcome, hope it leads to something. My understanding of PEAQ is that it's perceptual model is much more complicated than what we're talking about here.

I had one other idea to share. Way back when I wrote a test algorithm similar to what you're doing but with the goal of uncovering time variant processes such as DSP signal dependent gain, compression etc. The basic idea was to take any music stimulus you wanted, use a steep notch filter, and then embed a tone in the notch whose amplitude was modulated by the signal envelop so as to minimize disruption to the signal envelope. You'd then run it through the DUT, and compare output to stimulus by using a steep band pass filter, only looking at the tone. It uncovered a multitude of interesting signal dependent non linearities. I used it to reverse engineer the non linear signal processing of a third party device and presented to the company. They threatened to sue, thinking I'd hacked their firmware. So, I'd say it worked. :) Different than what you're doing with pk-error, in that its not a perceptual metric but a very powerful addition to the toolbox to see a bit more directly what real non linear processes are occurring in real time with any signal. I presented it to the IEEE and they were going to pursue but then I left the audio industry and I don't think it was picked up after I left.
 
Last edited:
OP
pkane

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,632
Likes
10,206
Location
North-East
Frequency domain discrimination alone doesn't account for forward masking, or the much lesser effective backwards masking. For example, take two different dynamic (eg signal burst) signals that result in the same 400ms average. What we hear will be very different if the first has all energy lumped into the first 100ms of a 400ms window, vs if it is comprised of two bursts, a lower level incident at the front end of the average, with a gap (eg 200ms) followed by an echo, or reflection. I think one unique and very powerful benefit of your metric resides in its ability to detect and uncover time variant processes, intended or otherwise. Summing them into a 400ms average will significantly obscure that resolution.

PK Metric has a near-term (within 400ms) masking implicit in the window size, and frequency masking from ERB filtering. Time-masking, at least at the level of 100ms, is accounted for in the overlap of 400ms windows (windows are advanced forward in time 100ms at each step).

Room measurements use smaller windows in an attempt to eliminate reflections, so that may not be the same reason for a shorter window in Audiolense.

I'll read PEAQ paper in more detail, but on a quick look, it appeared to be doing very similar things to PK Metric (at least their fast/real-time version).

I had one other idea to share. Way back when I wrote a test algorithm similar to what you're doing with but with the goal of uncovering time variant processes such as DSP signal dependent gain, compression etc. The basic idea was to take any music stimulus you wanted, use a steep notch filter, and then embed a tone in the notch whose amplitude was modulated by the signal envelop so as to minimize disruption to the signal envelope. You'd then run it through the DUT, and compare output to stimulus by using a steep band pass filter, only looking at the tone. It uncovered a multitude of interesting signal dependent non linearities. I used it to reverse engineer the non linear signal processing of a third party device and presented to the company. They threatened to sue, thinking I'd hacked their firmware. So, I'd say it worked. :) Different than what you're doing with pk-error, in that its not a perceptual metric but a very powerful addition to the toolbox to see a bit more directly what real non linear processes are occurring in real time with any signal. I presented it to the IEEE and they were going to pursue but then I left the audio industry and I don't think it was picked up after I left.

Take a deeper look at DeltaWave. It finds and corrects or measures many non-linear effects (including phase and amplitude errors, level non-linearity, jitter, clock drift, and some others) all without the need for an embedded signal, simply by figuring out the best way to match the recorded file back to the original.
 

DDF

Addicted to Fun and Learning
Joined
Dec 31, 2018
Messages
617
Likes
1,355
Room measurements use smaller windows in an attempt to eliminate reflections, so that may not be the same reason for a shorter window in Audiolense.

Room measurements use smaller windows to eliminate reflections but the better ones used something like fixed cycle count windows to try and be more predictive of the resultant in-room perceived timbre. Give it a look.

Take a deeper look at DeltaWave

I definitely will, investigating its capabilities will be a major project onto itself. Looking forward to it.

The embedded tone technique has distinct advantages over correlation comparative techniques, I think you'd be surprised. But its beyond the scope of what you're doing, just sharing another idea to consider or bin.
 
Top Bottom