• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Exploring a Neural Network-based Audio Upsampler: Overcoming the Ringing/Mirroring Trade-off

Nothing wrong.
Bellow is 7k, 11k,15k,19k
But that is not what the summed sine-waves look like- so there must be something wrong somewhere. What are you using to run the simulation?

Can you do a run with just the fundamental and third harmonic. It should look like this:

If you can also show a plot with the two sine waves shown separately one above the other, that might give us a clue also.

Screenshot 2026-02-28 at 13.45.33.png
 
I did had an error 5 instead of 5k. Sorry.
No need to apologise. We all get it wrong at times - as I ably demonstrated myself upthread.
 
The pre ringing at the transient is the major issue. On post 1 the OP shows Bessel IIR and min phase FIR that don't have pre ringing. These are multiplying coefficients that will multiply with the transient of the instrument notes and generate beat before we hear the note played. The post ringing is less problematic as the beat will be added to the harmonics of the note. If you listen carefully to the sound of piano instrument on digital recordings to appreciate the quality or guess the brand, you will notice that they all sound metallic. The musician who played it, can clearly remark that it is not the instrument he played, we, listeners, cannot, unless is a piano expert. This why, once at the recording, it get stained, it doesn't make much difference afterwards what kind of filter you are using. This why the NOS or OS it sounds the same with modern recordings.
 
The pre ringing at the transient is the major issue. On post 1 the OP shows Bessel IIR and min phase FIR that don't have pre ringing. These are multiplying coefficients that will multiply with the transient of the instrument notes and generate beat before we hear the note played.
How can pre-ringing be be problematic for those specific filters, if they don't even have it? You talk about "coefficients" again, why? We know the impulse responses in this case, which fully define the filters.

The post ringing is less problematic as the beat will be added to the harmonics of the note.
There's usually more than a single note playing at the same time in music. Filter artifacts will be mixed with all sorts of other frequencies present in the audio. There is also no evidence that pre-ringing is problematic or post-ringing may be less so. Can you provide any proof that it is?

If you listen carefully to the sound of piano instrument on digital recordings to appreciate the quality or guess the brand, you will notice that they all sound metallic. The musician who played it, can clearly remark that it is not the instrument he played, we, listeners, cannot, unless is a piano expert.
Could you provide any evidence for that claim?

This why, once at the recording, it get stained, it doesn't make much difference afterwards what kind of filter you are using. This why the NOS or OS it sounds the same with modern recordings.
With all due respect, but this is just voodoo. Also NOS mode usually does not quite sound the same as OS modes (for 44.1 kHz and maybe 48 kHz material), specifically because of the drooping frequency response - unless you oversample before sending the signal to the DAC in NOS mode.

Get a high res recording with the full 192 kHz bandwidth, downsample that to 44.1 kHz and do a blind ABX test with those two to prove that they sound any different - if that test succeeds, we can talk. Otherwise, you just repeat common audiophile fantasies.
 
On post 1 the OP shows Bessel IIR and min phase FIR that don't have pre ringing

As I keep on pointing out the pre ringing isn't really ringing. - it is the correct shape of a band limited signal. If you use a minimum phase filter, the phase distortion delays the high frequency gibbs effect until after the discontinuity - but this is actually less accurate.

Also - taking into account this is all happening at inaudible frequencies for adults - then it all becomes academic in any case.


you will notice that they all sound metallic.
This is nonsense.



Get a high res recording with the full 192 kHz bandwidth, downsample that to 44.1 kHz and do a blind ABX test with those two to prove that they sound any different - if that test succeeds, we can talk. Otherwise, you just repeat common audiophile fantasies.
This.
 
The pre ringing at the transient is the major issue. On post 1 the OP shows Bessel IIR and min phase FIR that don't have pre ringing. These are multiplying coefficients that will multiply with the transient of the instrument notes and generate beat before we hear the note played. The post ringing is less problematic as the beat will be added to the harmonics of the note. If you listen carefully to the sound of piano instrument on digital recordings to appreciate the quality or guess the brand, you will notice that they all sound metallic. The musician who played it, can clearly remark that it is not the instrument he played, we, listeners, cannot, unless is a piano expert. This why, once at the recording, it get stained, it doesn't make much difference afterwards what kind of filter you are using. This why the NOS or OS it sounds the same with modern recordings.
This is the second time similar claim has been made, so I just want to say:
1. As a drummer who has recorded at both a relatively expensive studio with a big SSL console and at home using cheapish digital consoles or low-end portable recorders, I have never heard anything like a pre-hit/pre-attack. And overall the sound is much, much more influenced by the mics used, their placement and potential effects like EQ and compression than by the recording device.
2. No pianist I know (mostly jazz/swing) has complained about unnatural sound like this either.

I do not claim to have golden ears (but I want to say that I've been wearing earplugs since I was 16 years old, so they're not ruined by drumming either), and I have found that many musicians much better than me care about sound less than I do, so I cannot claim to be a definite authority on this. But from my experience this is simply not true.
 
Digital or analogue. If I create the closest approximation of a square wave from its component sine waves - up to a maximum of 20kHz. Say this 1kHz square wave, with harmonics up to the 19th:
What you say is correct, but what I want to say is whether it is right to put that ringing into digital data in order to make music sound more natural. There is a difference between having it filtered by subsequent electrical circuits, transducers, the air, and your ears, and having it fixed as ringing in the digital data.

Delta-sigma DACs must perform upsampling, and the aim is to find a way to preserve that upsampling digitally in my way that makes sharp things as sharp as possible and smooth things as smooth as possible.
 
Get a high res recording with the full 192 kHz bandwidth, downsample that to 44.1 kHz and do a blind ABX test with those two to prove that they sound any different - if that test succeeds, we can talk. Otherwise, you just repeat common audiophile fantasies.

Please read this paper.
It is said that trained listeners are likely to be able to detect the difference. It is also discussed that rather than detecting high frequency sounds, it is more likely that they are detecting adverse effects such as ringing and mirror components.

 
What you say is correct, but what I want to say is whether it is right to put that ringing into digital data in order to make music sound more natural. There is a difference between having it filtered by subsequent electrical circuits, transducers, the air, and your ears, and having it fixed as ringing in the digital data.

Delta-sigma DACs must perform upsampling, and the aim is to find a way to preserve that upsampling digitally in my way that makes sharp things as sharp as possible and smooth things as smooth as possible.
It is not "put in to" the digital data. It is there in the band limited signal. A part of it. If the ADC didn't include it then it wouldn't be doing a good job of capturing the waveform. It would be less accurate.
 
It is not "put in to" the digital data. It is there in the band limited signal. A part of it. If the ADC didn't include it then it wouldn't be doing a good job of capturing the waveform. It would be less accurate.
So I cut off 20kHz~22.05kHz and put hi-res hallucination at upsampling in my NN upsampler.
 
So I cut off 20kHz~22.05kHz and put hi-res hallucination at upsampling in my NN upsampler.
Which the human ear can do nothing with. And in any case is less accurate to the source. You are adding stuff that is not in the recording.

It's inaudble, so doesn't really matter - but it has no benefit either.
 
Which the human ear can do nothing with. And in any case is less accurate to the source. You are adding stuff that is not in the recording.

It's inaudble, so doesn't really matter - but it has no benefit either.
Please read this paper.

I'm not saying that the difference is clear enough for everyone to recognize. It's a very small difference. However, there are statistical studies that humans can perceive the difference between high-resolution and low one.

The same argument can be made for digital cameras. I think there is a high possibility that we will slightly feel hallucination and spurious resolution good, even if they are mathematically incorrect.

---
(edit)The translation wasn't very good and didn't reflect my intentions, so I ended up changing quite a bit of the text. I'm sorry.
 
Last edited:
Please read this paper.

I'm not saying that the difference is clear enough for everyone to recognize. It's a very small difference. However, there are statistical studies that humans can perceive the difference between high-resolution and low one.

The same argument can be made for digital cameras. I think there is a high possibility that we will slightly feel hallucination and spurious resolution good, even if they are mathematically incorrect.

---
(edit)The translation wasn't very good and didn't reflect my intentions, so I ended up changing quite a bit of the text. I'm sorry.

There is almost no-one on this forum who's hearing will reach even close to 20kHz. We start to lose our high frequency hearing even before we leave our teens.

And please read this post discussing that study you link to:
 
Last edited:
There is almost no-one on this forum who's hearing will reach even close to 20kHz. We start to lose our high frequency hearing even before we leave our teens.

And please read this post discussing that study you link to:
Unfortunately, I did not yet find the time to analyze the Reiss paper itself in detail. That post seems to be a good discussion, though. The common problematic point in many of those experiments seems to be bad or at least non-standard dithering used during downsampling. This would align well with the fact that @amirm has shown on multiple occasions that he is able to ABX 16 vs 24 bit files (for example in this video around 18:30). He explains that he simply listens for the noise floor. On best case red book content, that will be down around -96 dB, which is not inaudible under all circumstances - especially when listening at elevated levels and with good sound isolation like it is provided by IEMs or closed back headphones. Without or with improper dithering, the problem would become more apparent even earlier.

Clearly, there is a plausible way to differentiate "high res" from red book quality material which has been demonstrated with very high confidence. In contrast, I have not seen convincing evidence yet that simply changing the sampling rate while keeping the bit depth constant would likely be audible. I hope to find time in the coming weeks to read the paper, but so far, I am sceptical.
 
There is almost no-one on this forum who's hearing will reach even close to 20kHz. We start to lose our high frequency hearing even before we leave our teens.
I don't discuss people can hear pure tone or pitch of about 20kHz.
And also I don't think we reconstruct hi-res sound curve perfectly.
Main theme is the side effects of these artifacts (ringing, imaging), for example transient smearing or IMD.
Again, I also think it's very difficult to recognize hi-res or not when listening to normal music.
Main theme is "Spurious resolution," which allows humans to perceive sound just a little more sharply.

And please read this post discussing that study you link to:
It is a good post. Thank you!
It's interesting.
Anyway it's difficult to recognize hi-res or not in normal music.
 
Yes and no. Any mathematically sound reconstruction has a much higher time resolution than 22.6 μs (for 44.1 kHz), but it is also just one valid interpretation of many. For a theoretical case with a perfectly bandlimited input signal and without quantization errors, I think there should be no loss of precision at all. With quantization, the calculation you linked is the lower limit. But in practice, it's not that good (input not perfectly bandlimited, etc.) and each reconstruction filter will have various deviations from the unknown ground truth - which are all still valid renditions of it.
Here is a pulse with 2.6 µs shift between channels played through a few DACs:
 
Here is a pulse with 2.6 µs shift between channels played through a few DACs:
Nice. :)

Can I ask what the original pulse was band limited at? (as accurately as I can read from the plots, about 8.8kHz.?) I assume from the shape with a minimum phase filter?
 
Last edited:
Can I ask what the original pulse was band limited at? (as accurately as I can read from the plots, about 8.8kHz.?) I assume from the shape with a minimum phase filter?
-6 dB at 10k with 20k transition band. It was SoX's "sinc -M -a140 -t20k -10k". The full command was:
Code:
sox -r384k -n imp.wav synth 1 sq 1 trim 0 1s pad .5 .5 sinc -M -a140 -t20k -10k norm -1 remix 1 1 delay 0 1s trim 0 1

imp2.8x.fft.png


I also had 19k with 2k transition band with similar results, but I thought the audiophile crowd will appreciate more the pulse with less ringing :-)

On another occasion (actually 2 days ago) I also generated 1 kHz tone with 50 ns shift between channels. I played and recorded that, then normalized the individual channels level and subtracted one channel from another to determine what was the shift in the recording:

fft.50ns.png
 
Thanks for the info.
I also had 19k with 2k transition band with similar results, but I thought the audiophile crowd will appreciate more the pulse with less ringing :-)
Though that would probably have been more exposing of differences - since it could be more impacted by any slow rolloff reconstruction filters.
 
Back
Top Bottom