• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Exploring a Neural Network-based Audio Upsampler: Overcoming the Ringing/Mirroring Trade-off

And the envelope (as you have shown it there) exists in the spectrum of the sound as its own frequency tone. If you FFT that waveform you have given as an example, it will show both the carrier (high frequency) and the baseband signal (the low) in the spectrum.

With the caveat that I don't understand all the mathematical details, I think this is wrong. See example below. Musicians use this when tuning their instrument.

1772184138389.png
 
Dumb question but would it be possible to use this in a free open source plug-in for foobar of vsti3 format to be used in other media players, so people can enable/disable and compare?
 
Hmm is not the so called pre ringing as seen in measurement a feature of the measurement you send a perfect impulse that in itself is not a valid bandwidth limited signal and you can watch the pre-ringing all day long but it is not "bad" as marketing will have it , its just how the filter works.

Properly bandwidth limited signals aka the music you try to play should not provoke this ringing behavior if "invalid" higher harmonics have leaked into the music then possibly some small ringing at very high frequencies at very low level , but it should not be typical ? not reason enough to abandon mathematically correct reconstruction .
 
Hi ASR community,
I would like to share an experimental Neural Network-based audio upsampler I've been working on, aiming to address the traditional mathematical trade-offs in upsampling filters.

Motivation
Traditional upsamplers inherently face a mathematical trade-off:
  • Using a brick-wall filter introduces ringing (time-domain smearing).
  • Suppressing ringing (e.g., using a slow roll-off) results in poor imaging/mirroring in the frequency domain.
Ideally, I wanted to achieve the best of both worlds: preserving the excellent transient response characteristic of NOS (Non-Oversampling) DACs, while eliminating their inherent drawback—the severe IMD caused by ultrasonic mirroring. To achieve this, I explored using a Neural Network to "hallucinate" inaudible ultra-high frequencies. The goal is to add plausible high-frequency content to maintain the original waveform's shape and transients, while actively suppressing unwanted mirroring.

Methodology
  • The audio is split around 20 kHz using an FIR filter.
  • Frequencies below 20 kHz are output exactly as they are.
  • For frequencies above 20 kHz, I trained a U-Net model using 88.2 kHz audio (with >20 kHz content as ground truth) to remove mirroring components.
  • During training, I introduced penalties for energy gaps and ringing.
  • Since the 20–22.05 kHz band (and its alias 22.05–24.1 kHz) typically contains a lot of noise/garbage, the output in this region is actively suppressed.
Measurements & Graphs

I’ve attached several measurement graphs comparing 3 methods: Bessel IIR 2x, Min-Phase FIR 2x (10000 taps, Kaiser $\beta$=10.0), and my NN Stage1 (C++ GPU).
  1. Sweep FFT: You can see that mirroring is effectively removed in the NN approach compared to the raw Bessel IIR.
    View attachment 513173
  2. Impulse / Square Wave: The NN produces very minimal ringing—significantly less than the long-tap FIR (10k min-phase).
    View attachment 513175View attachment 513174
  3. IMD (Twin-Tone): IMD characteristics are much better suppressed compared to the Bessel IIR.
    View attachment 513176
  4. Group Delay: Looks solid with no major issues in the audible band.
    (Note: In actual music waveforms, after a blank band from 20 to 24.05 kHz, you can observe the hallucinated frequencies, for instance on metallic drum hits.)
    View attachment 513177
Code & Training Data
The training results and the upsampling program(Nvidia GPU optimization) are open-source and available here:

https://github.com/michihitoTakami/totton-audio-neural-upsampler

Subjective Comparison
Admittedly, the differences are subtle and require trained ears to notice, but here are my subjective impressions:
  • Vs. Long-tap(10k) Min-Phase FIR: Transients feel noticeably sharper (e.g., metallic "clink" sounds are stronger). The post-echo sensation typical of min-phase filters is largely absent.
  • Vs. Standard Bessel IIR: The sound is much less "cloudy." The muddiness likely caused by IMD is reduced, making pianos and vocals sound cleaner.
Cognitive / ABX Test
I conducted a blind ABX test (N=1) with my 10-year-old son.
  • Source: "Evans on Evans" (Piano Trio).
  • Method: Switched at random 5-second intervals with a 1-second gap in various randomized patterns.
  • Result: He achieved a 100% correct distinction rate comparing the NN vs. FIR, and NN vs. Bessel IIR. (Note: For the FIR comparison, highs were slightly compensated using a High-Shelf IIR EQ). His comments independently matched my subjective reasoning above.
Other Challenges
  • Currently, I'm running this upsampler on a Jetson Orin Nano. It has a processing delay of just over 1 second, but it streams audio almost in real-time. I previously posted about my DIY GPU DSP box here:
    https://www.audiosciencereview.com/...y-gpu-dsp-ddc-box-looking-for-opinions.68354/
  • I am also working on a 4x upsampler, trying to train it on an ideal zero-padded signal instead of a Bessel IIR base, but it hasn't converged well yet.
Conclusion
I would love to hear your thoughts, technical feedback, or any questions you might have. If you have the setup to test the code, please give it a try and let me know how it measures and sounds to you!
An interesting experiment that's made me think differently, which I enjoy.

Although not relevent to your solution, have you taken the content that the Neural Network generates above 44kHz and slowed it down to listen to what it sounds like? If you do this with content natively recorded at 96kHz, it sometimes obvious how the inaudible ultrasonic sounds "map" to the normally audible sounds. Given the NN is "recreating" the "missing content" what does it sound like?
 
The ringing is not a result of filtering per se. Ringing is a natural "feature" of band-limited signals. And there are no pure impulses, or square waves in nature because *all* transmission mediums are band-limited somewhere.

You can construct a band-limited square wave without any filtering, by adding the Fourier sequence of every third harmonic at 1/(harmonic number) amplitude, stopping at the highest frequency harmonic that fits in the band limit.
There are no acoustic filters in nature that have a steep brick wall. If you try to realize this steep brick wall, a lot of ringing occurs. With a smooth low-pass filter like the ear, ringing is hardly an issue.
So my idea is to use a neural network based on a gentle low-pass filter to effectively remove mirroring.

Yes, your results show that you can suppress the ringing quite a bit using that method. I am just not sure the ringing is a problem and that the increased envelope is problematic, outside of looking ugly. I have seen claims on various forums and websites where people were under the impression that they could hear pre-ringing in linear phase filters, but I am not aware of well designed studies which support those claims.
High frequencies above 20kHz are problematic, but the key to this method is to train the system to produce as natural sounds as possible. Another key point is to apply an energy cap to prevent the system from producing excessively loud sounds in the high frequency range. The goal is to create a neural network that can produce natural high-frequency sounds.


Why is the stopband rejection of the 10k-tap FIR filter so poor? For comparison, here are some plots of the default "ridiculous overkill" filter I currently use in my resampler (shown for 96kHz to 44.1kHz conversion)—
Thanks for pointing that out. I had the window function settings wrong. A Kaiser window with β=10 or so would actually result in a brick wall of -100 to -110 dB. Since I didn't do any special adjustments to lower it, the noise floor is somewhat higher than linear phase because it's minimum phase.
 
Once at recording it is stained, it doesn't matter any more to have it passing ultra transparent reproduction.
That's why we discard the 20kHz to 24.1kHz range. We recognize this frequency range as containing some ringing from mastering and recording. We have included this in the learning settings to suppress it.
 
The envelope of the strike from the start of the sound until peak amplitude takes about 8 ms (equivalent to about 125Hz). The highest frequency of that strike phase is less than 9 kHz. Hardly an infinite bandwidth impulse, or square wave.
I think the fundamental frequency of metallic sounds is around 8 kHz. That's why I recommend you do a Fourier expansion and check whether it contains very high frequency components (20-22.05 kHz). If it does, ringing will theoretically occur.
 
With the caveat that I don't understand all the mathematical details, I think this is wrong. See example below. Musicians use this when tuning their instrument.

View attachment 514102


You are correct, and I was wrong in my statement above. The envelope is not in the spectrum. What we are seeing as the envelope is the beat frequency between two near frequencies - and what we hear is the level change in the sound. This is how a guitarist uses it for tuning. Once the frequencies between the two strings match, he can hear it because there is no longer any beating.

So in the example above - yes, we can hear a beat frequency - when it is low enough to perceive as a volume beat. I am not sure what we hear if the beat frequency is higher though.

But what is also clear - is that if the two beating frequencies are ultrasonic - then we will not hear anything - because the ear will not detect either of them, and there is then nothing to beat. The only way we might hear something, is if the two frequencies inter-modulate in the ear canal (due to non-linearities), resulting in difference frequencies in the audible range being generated that way. Though I have no idea if this actually happens.
 
Once at recording it is stained, it doesn't matter any more to have it passing ultra transparent reproduction.
That's why we discard the 20kHz to 24.1kHz range. We recognize this frequency range as containing some ringing from mastering and recording. We have included this in the learning settings to suppress it.
Properly bandwidth limited signals aka the music you try to play should not provoke this ringing behavior if "invalid" higher harmonics have leaked into the music then possibly some small ringing at very high frequencies at very low level , but it should not be typical ? not reason enough to abandon mathematically correct reconstruction .
If mathematical accuracy were important, everyone would probably be satisfied with delta-sigma or long-tap FIR. But not everyone does. It's true that some people say NOS's R2R is good. So, I wonder if we can create a high-precision version of it, one that removes the mirroring suppression, by processing it digitally.
Although not relevent to your solution, have you taken the content that the Neural Network generates above 44kHz and slowed it down to listen to what it sounds like? If you do this with content natively recorded at 96kHz, it sometimes obvious how the inaudible ultrasonic sounds "map" to the normally audible sounds. Given the NN is "recreating" the "missing content" what does it sound like?
I don't listen to that. I've given up on reproducing the tone and pitch, and I'm focusing on the good waveform envelope and minimal mirror components.
 
But what is also clear - is that if the two beating frequencies are ultrasonic - then we will not hear anything - because the ear will not detect either of them, and there is then nothing to beat. The only way we might hear something, is if the two frequencies inter-modulate in the ear canal (due to non-linearities), resulting in difference frequencies in the audible range being generated that way. Though I have no idea if this actually happens.
Nonlinear components exist not only in the ear, but also in electrical circuits, diaphragms, and the air.

In any case, it is really difficult to distinguish between high-resolution and non-high-resolution audio sources, and it is true that we cannot tell the difference unless we train and listen carefully to sounds that contain a lot of high-frequency components, such as metallic sounds.
 
I think the fundamental frequency of metallic sounds is around 8 kHz. That's why I recommend you do a Fourier expansion and check whether it contains very high frequency components (20-22.05 kHz). If it does, ringing will theoretically occur.

Oh, yes it will have harmonics going up to 20kHz - but that doesn't make it an impulse. Or even anything close to it. Any more than a piccolo note with it's harmonics is an impulse.

And what you are calling ringing is just a feature of a band limited signal. Essentially just the frequencies that are left (that were there in the signal anyway) after out of band signals are removed. And will be present in the signals leaving the ear, (heading brainwards), if the signal reaching your ear has ultrasonic content. Since as I state above - the ear is a biomechanical band pass filter - and much messier than your typical reconstruction filter.
 
Nonlinear components exist not only in the ear, but also in electrical circuits, diaphragms, and the air.
None of which means we can detect the envelope of ultrasonic signals in the ear - without one of those mechanisms creating an audible band tone.
 
That's why we discard the 20kHz to 24.1kHz range. We recognize this frequency range as containing some ringing from mastering and recording. We have included this in the learning settings to suppress it.

If mathematical accuracy were important, everyone would probably be satisfied with delta-sigma or long-tap FIR. But not everyone does. It's true that some people say NOS's R2R is good. So, I wonder if we can create a high-precision version of it, one that removes the mirroring suppression, by processing it digitally.

I don't listen to that. I've given up on reproducing the tone and pitch, and I'm focusing on the good waveform envelope and minimal mirror components.
I do think everyone would be satisfied with current delta sigma , if tested blind regardless of what they think beforehand :)

I do find your method very interesting , but what the neural network does must eventually result in an actual filter or filter parameters ?
 
There are no acoustic filters in nature that have a steep brick wall.
The bandpass filters in the human cochlea are not far off what I'd call "brick wall" toward the high end: something approaching -100dB only half an octave above the center frequency. Not as steep as digital filters can be, but pretty darn steep nonetheless. And yes, they do ring quite a bit.
 
The bandpass filters in the human cochlea are not far off what I'd call "brick wall" toward the high end: something approaching -100dB only half an octave above the center frequency. Not as steep as digital filters can be, but pretty darn steep nonetheless. And yes, they do ring quite a bit.

Half an octave from 20kHz is 30kHz.
On the other hand, brick wall window function cuts off noise only from 20kHz to 22.05kHz
 
Oh, yes it will have harmonics going up to 20kHz - but that doesn't make it an impulse. Or even anything close to it. Any more than a piccolo note with it's harmonics is an impulse.

And what you are calling ringing is just a feature of a band limited signal. Essentially just the frequencies that are left (that were there in the signal anyway) after out of band signals are removed. And will be present in the signals leaving the ear, (heading brainwards), if the signal reaching your ear has ultrasonic content. Since as I state above - the ear is a biomechanical band pass filter - and much messier than your typical reconstruction filter.
Just feature of a band limited “digital” signal.
We discuss digital stepped signal.

When high frequencies are upsampled using a brick wall window function, the usable order of the Fourier series is small and the result is effectively an impulse or square wave, so if you try to connect them smoothly, ringing will occur.
 
Just feature of a band limited “digital” signal.
Digital or analogue. If I create the closest approximation of a square wave from its component sine waves - up to a maximum of 20kHz. Say this 1kHz square wave, with harmonics up to the 19th:

Screenshot 2026-02-28 at 08.36.01.png


I can create it digitally, then convert it to analogue, filtering it to eliminate imaging.

Or I could create it with 19 accurate analogue sine generators with the outputs summed into an op amp summing circuit. Absolutely no filtering necessary, not a digit in sight.

The result is identical. Identical shape, identical Gibbs phenomena (what you are calling ringing).


What you are calling ringing, I am simply calling the natural shape of a band-limited waveform that results when out-of-band frequencies are removed. For example: The same square wave but now only band-limited to 50kHz, it looks like this.

Screenshot 2026-02-28 at 08.42.58.png


Let’s say I created this in the analogue fashion. Now I turn off the top 21 sine generators (only odd harmonics are included), and I get this:

Screenshot 2026-02-28 at 08.49.22.png


7kHz band-limited. No filtering. Just what remains when the higher frequencies that are sharpening the edges are no longer there.

As I say - the natural and correct form of a band-limited signal. Whether that is analogue or digital.



so if you try to connect them smoothly, ringing will occur.

Or - gibbs phenomena will be revealed, as the out of band frequencies are removed.
 
Last edited:
I simulated 1khz sin 1v and odd harmonics up to 19khz with 1/n amplitude to be added. This what the scope shows.
 

Attachments

  • generator.JPG
    generator.JPG
    29.5 KB · Views: 44
  • 1khz square 19h.JPG
    1khz square 19h.JPG
    40.3 KB · Views: 36
I simulated 1khz sin 1v and odd harmonics up to 19khz with 1/n amplitude to be added. This what the scope shows.
Something wrong with your simulation or scope. I would check you don't have any filters (or other bandwidth issues) starting to have an effect below 18kHz.

Or if this is a digital simulation, ensuring you do have correct anti imaging filter in place.

If you want a mathematical simulation, this is what I used:

 
Nothing wrong.
Bellow is 7k, 11k,15k,19k
 

Attachments

  • mutiple.JPG
    mutiple.JPG
    55.7 KB · Views: 44
Back
Top Bottom