Hi ASR community,
I would like to share an experimental Neural Network-based audio upsampler I've been working on, aiming to address the traditional mathematical trade-offs in upsampling filters.
Motivation
Traditional upsamplers inherently face a mathematical trade-off:
Methodology
I’ve attached several measurement graphs comparing 3 methods: Bessel IIR 2x, Min-Phase FIR 2x (10000 taps, Kaiser $\beta$=10.0), and my NN Stage1 (C++ GPU).
The training results and the upsampling program(Nvidia GPU optimization) are open-source and available here:
https://github.com/michihitoTakami/totton-audio-neural-upsampler
Subjective Comparison
Admittedly, the differences are subtle and require trained ears to notice, but here are my subjective impressions:
I conducted a blind ABX test (N=1) with my 10-year-old son.
I would love to hear your thoughts, technical feedback, or any questions you might have. If you have the setup to test the code, please give it a try and let me know how it measures and sounds to you!
I would like to share an experimental Neural Network-based audio upsampler I've been working on, aiming to address the traditional mathematical trade-offs in upsampling filters.
Motivation
Traditional upsamplers inherently face a mathematical trade-off:
- Using a brick-wall filter introduces ringing (time-domain smearing).
- Suppressing ringing (e.g., using a slow roll-off) results in poor imaging/mirroring in the frequency domain.
Methodology
- The audio is split around 20 kHz using an FIR filter.
- Frequencies below 20 kHz are output exactly as they are.
- For frequencies above 20 kHz, I trained a U-Net model using 88.2 kHz audio (with >20 kHz content as ground truth) to remove mirroring components.
- During training, I introduced penalties for energy gaps and ringing.
- Since the 20–22.05 kHz band (and its alias 22.05–24.1 kHz) typically contains a lot of noise/garbage, the output in this region is actively suppressed.
I’ve attached several measurement graphs comparing 3 methods: Bessel IIR 2x, Min-Phase FIR 2x (10000 taps, Kaiser $\beta$=10.0), and my NN Stage1 (C++ GPU).
- Sweep FFT: You can see that mirroring is effectively removed in the NN approach compared to the raw Bessel IIR.
- Impulse / Square Wave: The NN produces very minimal ringing—significantly less than the long-tap FIR (10k min-phase).
- IMD (Twin-Tone): IMD characteristics are much better suppressed compared to the Bessel IIR.
- Group Delay: Looks solid with no major issues in the audible band.
(Note: In actual music waveforms, after a blank band from 20 to 24.05 kHz, you can observe the hallucinated frequencies, for instance on metallic drum hits.)
The training results and the upsampling program(Nvidia GPU optimization) are open-source and available here:
https://github.com/michihitoTakami/totton-audio-neural-upsampler
Subjective Comparison
Admittedly, the differences are subtle and require trained ears to notice, but here are my subjective impressions:
- Vs. Long-tap(10k) Min-Phase FIR: Transients feel noticeably sharper (e.g., metallic "clink" sounds are stronger). The post-echo sensation typical of min-phase filters is largely absent.
- Vs. Standard Bessel IIR: The sound is much less "cloudy." The muddiness likely caused by IMD is reduced, making pianos and vocals sound cleaner.
I conducted a blind ABX test (N=1) with my 10-year-old son.
- Source: "Evans on Evans" (Piano Trio).
- Method: Switched at random 5-second intervals with a 1-second gap in various randomized patterns.
- Result: He achieved a 100% correct distinction rate comparing the NN vs. FIR, and NN vs. Bessel IIR. (Note: For the FIR comparison, highs were slightly compensated using a High-Shelf IIR EQ). His comments independently matched my subjective reasoning above.
- Currently, I'm running this upsampler on a Jetson Orin Nano. It has a processing delay of just over 1 second, but it streams audio almost in real-time. I previously posted about my DIY GPU DSP box here:
https://www.audiosciencereview.com/...y-gpu-dsp-ddc-box-looking-for-opinions.68354/ - I am also working on a 4x upsampler, trying to train it on an ideal zero-padded signal instead of a Bessel IIR base, but it hasn't converged well yet.
I would love to hear your thoughts, technical feedback, or any questions you might have. If you have the setup to test the code, please give it a try and let me know how it measures and sounds to you!
Last edited:


