• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Exploring a Neural Network-based Audio Upsampler: Overcoming the Ringing/Mirroring Trade-off

michihito

Member
Joined
Dec 23, 2025
Messages
32
Likes
43
Location
Japan
Hi ASR community,
I would like to share an experimental Neural Network-based audio upsampler I've been working on, aiming to address the traditional mathematical trade-offs in upsampling filters.

Motivation
Traditional upsamplers inherently face a mathematical trade-off:
  • Using a brick-wall filter introduces ringing (time-domain smearing).
  • Suppressing ringing (e.g., using a slow roll-off) results in poor imaging/mirroring in the frequency domain.
Ideally, I wanted to achieve the best of both worlds: preserving the excellent transient response characteristic of NOS (Non-Oversampling) DACs, while eliminating their inherent drawback—the severe IMD caused by ultrasonic mirroring. To achieve this, I explored using a Neural Network to "hallucinate" inaudible ultra-high frequencies. The goal is to add plausible high-frequency content to maintain the original waveform's shape and transients, while actively suppressing unwanted mirroring.

Methodology
  • The audio is split around 20 kHz using an FIR filter.
  • Frequencies below 20 kHz are output exactly as they are.
  • For frequencies above 20 kHz, I trained a U-Net model using 88.2 kHz audio (with >20 kHz content as ground truth) to remove mirroring components.
  • During training, I introduced penalties for energy gaps and ringing.
  • Since the 20–22.05 kHz band (and its alias 22.05–24.1 kHz) typically contains a lot of noise/garbage, the output in this region is actively suppressed.
Measurements & Graphs

I’ve attached several measurement graphs comparing 3 methods: Bessel IIR 2x, Min-Phase FIR 2x (10000 taps, Kaiser $\beta$=10.0), and my NN Stage1 (C++ GPU).
  1. Sweep FFT: You can see that mirroring is effectively removed in the NN approach compared to the raw Bessel IIR.
    compare_3method_sweep_distill.png
  2. Impulse / Square Wave: The NN produces very minimal ringing—significantly less than the long-tap FIR (10k min-phase).
    impulse_response_3way.png
    distill_500hz_square.png
  3. IMD (Twin-Tone): IMD characteristics are much better suppressed compared to the Bessel IIR.
    imd_18k_19k_3way.png
  4. Group Delay: Looks solid with no major issues in the audible band.
    (Note: In actual music waveforms, after a blank band from 20 to 24.05 kHz, you can observe the hallucinated frequencies, for instance on metallic drum hits.)
    group_delay_3way.png
Code & Training Data
The training results and the upsampling program(Nvidia GPU optimization) are open-source and available here:

https://github.com/michihitoTakami/totton-audio-neural-upsampler

Subjective Comparison
Admittedly, the differences are subtle and require trained ears to notice, but here are my subjective impressions:
  • Vs. Long-tap(10k) Min-Phase FIR: Transients feel noticeably sharper (e.g., metallic "clink" sounds are stronger). The post-echo sensation typical of min-phase filters is largely absent.
  • Vs. Standard Bessel IIR: The sound is much less "cloudy." The muddiness likely caused by IMD is reduced, making pianos and vocals sound cleaner.
Cognitive / ABX Test
I conducted a blind ABX test (N=1) with my 10-year-old son.
  • Source: "Evans on Evans" (Piano Trio).
  • Method: Switched at random 5-second intervals with a 1-second gap in various randomized patterns.
  • Result: He achieved a 100% correct distinction rate comparing the NN vs. FIR, and NN vs. Bessel IIR. (Note: For the FIR comparison, highs were slightly compensated using a High-Shelf IIR EQ). His comments independently matched my subjective reasoning above.
Other Challenges
  • Currently, I'm running this upsampler on a Jetson Orin Nano. It has a processing delay of just over 1 second, but it streams audio almost in real-time. I previously posted about my DIY GPU DSP box here:
    https://www.audiosciencereview.com/...y-gpu-dsp-ddc-box-looking-for-opinions.68354/
  • I am also working on a 4x upsampler, trying to train it on an ideal zero-padded signal instead of a Bessel IIR base, but it hasn't converged well yet.
Conclusion
I would love to hear your thoughts, technical feedback, or any questions you might have. If you have the setup to test the code, please give it a try and let me know how it measures and sounds to you!
 
Last edited:
This looks like a fun experiment. The resulting impulse response looks kinda like short delay slow filters on some DACs. It's mixed phase. Kind of odd that you loose 20 dB in magnitude up to 20 kHz. That seems like something to focus on for the next iteration.

EDIT: It also looks like the filter suppresses some images while enhancing others. Did you investigate whether that further?

Concerning the "time domain smearing" of regular fast filters: That's not really a thing. You do not know how the signal looked before downsampling and a typical linear phase reconstruction filter is a good (maybe the best?) approximation of how it could have looked. See this post for some examples.

Concerning NOS and transient response: Yes, it's steep. Because it contains lots of ultrasonic crap. If you use regular reconstruction filters to guesstimate that transient, the highest frequency component added will be around Nyquist. For CD audio, that's 22.05 kHz. Since we can't hear anything above that, it's irrelevant how much steeper any other filter or NOS mode itself could "reconstruct" that transient: Our ears will not be able to perceive the faster rise time, because they are not sensitive to the higher frequency components in it. The tiny hairs in our inner ear simply can't be moved fast enough to register that faster rise time.
 
Last edited:
This looks like a fun experiment. The resulting impulse response looks kinda like short delay slow filters on some DACs. It's mixed phase. Kind of odd that you loose 20 dB in magnitude up to 20 kHz. That seems like something to focus on for the next iteration.

EDIT: It also looks like the filter suppresses some images while enhancing others. Did you investigate whether that further?

Concerning the "time domain smearing" of regular fast filters: That's not really a thing. You do not know how the signal looked before downsampling and a typical linear phase reconstruction filter is a good (maybe the best?) approximation of how it could have looked. See this post for some examples.

Concerning NOS and transient response: Yes, it's steep. Because it contains lots of ultrasonic crap. If you use regular reconstruction filters to guesstimate that transient, the highest frequency component added will be around Nyquist. For CD audio, that's 22.05 kHz. Since we can't hear anything above that, it's irrelevant how much steeper any other filter or NOS mode itself could "reconstruct" that transient: Our ears will not be able to perceive the faster rise time, because they are not sensitive to the higher frequency components in it. The tiny hairs in our inner ear simply can't be moved fast enough to register that faster rise time.

Thank you for taking the time to look into this and for the constructive feedback! You raised some excellent points regarding standard digital audio theory. I'd like to clarify the design philosophy and why a Neural Network (non-linear) approach behaves differently from traditional LTI (Linear Time-Invariant) filters.

1. Regarding the 20dB magnitude drop at 20 kHz
This is partially a measurement(FFT Algorithm) artifact. But because the base process utilizes a slow roll-off filter to eliminate ringing natively, some high-frequency attenuation (around -6dB at 20kHz depending on the base FIR) is mathematically inevitable. I acknowledge this is a trade-off and an area for further refinement.I am looking into applying an ultra-short tap FIR convolution to gently EQ and compensate for the attenuated response in that region.

2. Selective suppression and enhancement of images
This is actually the core feature of the AI approach, rather than a bug. Traditional LTI filters apply the same mathematical rules regardless of the signal. The NN, however, is content-aware.
For sharp transients (like square waves), it intentionally generates/allows specific high-frequency components (which look like imaging) to reproduce the fast rise time without adding ringing. Conversely, for continuous, smooth signals (like sine waves), it actively suppresses those imaging components. The goal is to dynamically adapt the filtering based on the waveform's shape.

3. "Time domain smearing" vs. Sinc Interpolation
I completely agree that for a strictly band-limited signal, a linear phase reconstruction filter (sinc interpolation) is mathematically the most accurate way to reconstruct the sampled data.
However, the philosophical question here is: "What is the original sound?" Acoustic instruments and pre-mastered, un-band-limited high-res analog waveforms do not contain pre-ringing or post-ringing. The ambition of this upsampler isn't just to mathematically reconstruct the band-limited CD signal, but rather to estimate and hallucinate the pre-ADC continuous waveform that existed before the brick-wall anti-aliasing filter was applied.

4. Transient response, Ultrasonic frequencies, and IMD
You are absolutely correct that the tiny hairs in our inner ear cannot physically respond to frequencies like 22 kHz or higher. However, dealing with these frequencies is crucial for two physical/psychoacoustic reasons:

IMD (Intermodulation Distortion): If continuous ultrasonic mirroring/garbage is passed through to the amplifier and speakers, non-linearities in the hardware will cause IMD to fold back down into the audible band. Thus, removing continuous mirroring is necessary for audible clarity.

Time-domain Envelope Perception: While we cannot hear ultrasonic pitch, high frequencies dictate the amplitude envelope and timing of transients. Human hearing is extremely sensitive to timing cues (microsecond level ITD). Furthermore, it is well documented in acoustic research that speaker diaphragms, the air itself, and the human cochlea exhibit non-linear behavior. Because of these non-linearities, the temporal envelope generated by interacting high-frequency components can be demodulated or folded down into the audible range, making it perceptible to humans. Pre-ringing introduces unnatural anticipatory energy, and post-ringing can act as a subtle smearing of the envelope.

---

The NN approach is an experimental attempt to bypass the rigid LTI trade-offs. This upsampling doesn't aim to reproduce inaudible high frequencies as pitches. Instead, it purposefully 'hallucinates' high-frequency information to smooth out the envelope, bringing the waveform closer to its natural acoustic state. It is definitely not perfect yet, but I appreciate your scientific perspective to help push the iteration forward!
 
Thank you for taking the time to look into this and for the constructive feedback! You raised some excellent points regarding standard digital audio theory. I'd like to clarify the design philosophy and why a Neural Network (non-linear) approach behaves differently from traditional LTI (Linear Time-Invariant) filters.

1. Regarding the 20dB magnitude drop at 20 kHz
This is partially a measurement(FFT Algorithm) artifact. But because the base process utilizes a slow roll-off filter to eliminate ringing natively, some high-frequency attenuation (around -6dB at 20kHz depending on the base FIR) is mathematically inevitable. I acknowledge this is a trade-off and an area for further refinement.I am looking into applying an ultra-short tap FIR convolution to gently EQ and compensate for the attenuated response in that region.
You can use
Code:
freqz( ... )
to plot the frequency response without those artifacts in MATLAB/Octave, which I assume you used for those plots. I think this drop in FR should be a top priority to fix, if you want to compare your filter to any other regular filter. Should also be pretty easy to do.

2. Selective suppression and enhancement of images
This is actually the core feature of the AI approach, rather than a bug. Traditional LTI filters apply the same mathematical rules regardless of the signal. The NN, however, is content-aware.
For sharp transients (like square waves), it intentionally generates/allows specific high-frequency components (which look like imaging) to reproduce the fast rise time without adding ringing. Conversely, for continuous, smooth signals (like sine waves), it actively suppresses those imaging components. The goal is to dynamically adapt the filtering based on the waveform's shape.
I get the idea, but I must admit I'm sceptical. I've got some limited experience with neural network training and some more with signal processing. In essence, my gut tells me that "there is no free lunch": If you very selectively suppress specific frequency components, this is likely to have some undesired effects. The NN-filter FR also shows a very "grassy" presentation past ~24 kHz, which is likely due to the selective suppression that it is trained for, but the overall energy in the ultrasonic region is still very high. I would expect that this will show some very odd behaviour with otherwise benign signals.

Did you test the NN-filter with a broader selection of synthetic (sine wave) signals and possibly some actual music? I assume you didn't plot everything in your first post because it would have taken up too much space. But could you show some more varied examples of suppressed images in more complex signals?

3. "Time domain smearing" vs. Sinc Interpolation
I completely agree that for a strictly band-limited signal, a linear phase reconstruction filter (sinc interpolation) is mathematically the most accurate way to reconstruct the sampled data.
However, the philosophical question here is: "What is the original sound?" Acoustic instruments and pre-mastered, un-band-limited high-res analog waveforms do not contain pre-ringing or post-ringing. The ambition of this upsampler isn't just to mathematically reconstruct the band-limited CD signal, but rather to estimate and hallucinate the pre-ADC continuous waveform that existed before the brick-wall anti-aliasing filter was applied.
The idea that there is no ringing in the ground truth signal is not quite correct, as far as I understand it. Instruments create sound through vibration, which cannot start or stop immediately due to the inertia of the instrument itself. A vibration will always have a non-zero fade-in and fade-out time. Therefore, there will be components in the ground truth wave form which look like post ringing and likely also some minimal pre-ringing-like. At least for real instruments, maybe not so much for synthesized stuff.

I get that those vibrations are technically not what we describe as ringing when talking about filters - they are merely ultrasonic components of small-ish magnitude which get lost during downsampling. But the idea that such components should not exist at all in a reconstructed signal is incorrect, as far as I understand it.

4. Transient response, Ultrasonic frequencies, and IMD
You are absolutely correct that the tiny hairs in our inner ear cannot physically respond to frequencies like 22 kHz or higher. However, dealing with these frequencies is crucial for two physical/psychoacoustic reasons:

IMD (Intermodulation Distortion): If continuous ultrasonic mirroring/garbage is passed through to the amplifier and speakers, non-linearities in the hardware will cause IMD to fold back down into the audible band. Thus, removing continuous mirroring is necessary for audible clarity.
IMD is an effect of some concern, I agree.

Time-domain Envelope Perception: While we cannot hear ultrasonic pitch, high frequencies dictate the amplitude envelope and timing of transients. Human hearing is extremely sensitive to timing cues (microsecond level ITD). Furthermore, it is well documented in acoustic research that speaker diaphragms, the air itself, and the human cochlea exhibit non-linear behavior. Because of these non-linearities, the temporal envelope generated by interacting high-frequency components can be demodulated or folded down into the audible range, making it perceptible to humans. Pre-ringing introduces unnatural anticipatory energy, and post-ringing can act as a subtle smearing of the envelope.
I am not so convinced about this point. First of all, let me say that the wording of this paragraph is quite obscure. For example, what is "temporal envelope"? I assume you used an LLM to at least reformulate your post? If so, please consider using your own words and just a translator if necessary. LLM's tend to create long-winding, well sounding text even if the idea would fit into fewer words. This can make the resulting output difficult to follow.

Concerning the timing argument itself, our ears may be sensitive on the microsecond level - I am not sure down to which exact resolution. However, at 44.1 kHz, every timing info below the sampling time of 22.6 µs is essentially lost. So pretty much any reconstruction we do inside that time window would be considered valid - be it using an NN, a minimum phase or a linear phase filter. I don't see any reason why the NN approach would be "more true" than the others here, but it is also not inherently "less true".

The argmument about "anticipatory energy" and "smearing of the envelope" sounds quite obscure to me and I am not sure how to interpret that correctly.

The NN approach is an experimental attempt to bypass the rigid LTI trade-offs. This upsampling doesn't aim to reproduce inaudible high frequencies as pitches. Instead, it purposefully 'hallucinates' high-frequency information to smooth out the envelope, bringing the waveform closer to its natural acoustic state. It is definitely not perfect yet, but I appreciate your scientific perspective to help push the iteration forward!
Again, I think it is a very interesting experiment! :)
 
Last edited:
Hi ASR community,
I would like to share an experimental Neural Network-based audio upsampler I've been working on, aiming to address the traditional mathematical trade-offs in upsampling filters.

Motivation
Traditional upsamplers inherently face a mathematical trade-off:
  • Using a brick-wall filter introduces ringing (time-domain smearing).
  • Suppressing ringing (e.g., using a slow roll-off) results in poor imaging/mirroring in the frequency domain.
Ideally, I wanted to achieve the best of both worlds: preserving the excellent transient response characteristic of NOS (Non-Oversampling) DACs, while eliminating their inherent drawback—the severe IMD caused by ultrasonic mirroring. To achieve this, I explored using a Neural Network to "hallucinate" inaudible ultra-high frequencies. The goal is to add plausible high-frequency content to maintain the original waveform's shape and transients, while actively suppressing unwanted mirroring.

Methodology
  • The audio is split around 20 kHz using an FIR filter.
  • Frequencies below 20 kHz are output exactly as they are.
  • For frequencies above 20 kHz, I trained a U-Net model using 88.2 kHz audio (with >20 kHz content as ground truth) to remove mirroring components.
  • During training, I introduced penalties for energy gaps and ringing.
  • Since the 20–22.05 kHz band (and its alias 22.05–24.1 kHz) typically contains a lot of noise/garbage, the output in this region is actively suppressed.
Measurements & Graphs

I’ve attached several measurement graphs comparing 3 methods: Bessel IIR 2x, Min-Phase FIR 2x (10000 taps, Kaiser $\beta$=10.0), and my NN Stage1 (C++ GPU).
  1. Sweep FFT: You can see that mirroring is effectively removed in the NN approach compared to the raw Bessel IIR.
    View attachment 513173
  2. Impulse / Square Wave: The NN produces very minimal ringing—significantly less than the long-tap FIR (10k min-phase).
    View attachment 513175View attachment 513174
  3. IMD (Twin-Tone): IMD characteristics are much better suppressed compared to the Bessel IIR.
    View attachment 513176
  4. Group Delay: Looks solid with no major issues in the audible band.
    (Note: In actual music waveforms, after a blank band from 20 to 24.05 kHz, you can observe the hallucinated frequencies, for instance on metallic drum hits.)
    View attachment 513177
Code & Training Data
The training results and the upsampling program(Nvidia GPU optimization) are open-source and available here:

https://github.com/michihitoTakami/totton-audio-neural-upsampler

Subjective Comparison
Admittedly, the differences are subtle and require trained ears to notice, but here are my subjective impressions:
  • Vs. Long-tap(10k) Min-Phase FIR: Transients feel noticeably sharper (e.g., metallic "clink" sounds are stronger). The post-echo sensation typical of min-phase filters is largely absent.
  • Vs. Standard Bessel IIR: The sound is much less "cloudy." The muddiness likely caused by IMD is reduced, making pianos and vocals sound cleaner.
Cognitive / ABX Test
I conducted a blind ABX test (N=1) with my 10-year-old son.
  • Source: "Evans on Evans" (Piano Trio).
  • Method: Switched at random 5-second intervals with a 1-second gap in various randomized patterns.
  • Result: He achieved a 100% correct distinction rate comparing the NN vs. FIR, and NN vs. Bessel IIR. (Note: For the FIR comparison, highs were slightly compensated using a High-Shelf IIR EQ). His comments independently matched my subjective reasoning above.
Other Challenges
  • Currently, I'm running this upsampler on a Jetson Orin Nano. It has a processing delay of just over 1 second, but it streams audio almost in real-time. I previously posted about my DIY GPU DSP box here:
    https://www.audiosciencereview.com/...y-gpu-dsp-ddc-box-looking-for-opinions.68354/
  • I am also working on a 4x upsampler, trying to train it on an ideal zero-padded signal instead of a Bessel IIR base, but it hasn't converged well yet.
Conclusion
I would love to hear your thoughts, technical feedback, or any questions you might have. If you have the setup to test the code, please give it a try and let me know how it measures and sounds to you!
An interesting experiment. But as (more tactfully) pointed out by @RandomEar it is a solution looking for a problem.
 
However, at 44.1 kHz, every timing info below the sampling time of 22.6 µs is essentially lost.
Actually this is also a myth about digital audio - as explained here, by our own @mansr


and as demonstrated in the Monty video on an analogue oscilloscope at about the 20:40 point.
 
Actually this is also a myth about digital audio - as explained here, by our own @mansr


and as demonstrated in the Monty video on an analogue oscilloscope at about the 20:40 point.
Yes and no. Any mathematically sound reconstruction has a much higher time resolution than 22.6 μs (for 44.1 kHz), but it is also just one valid interpretation of many. For a theoretical case with a perfectly bandlimited input signal and without quantization errors, I think there should be no loss of precision at all. With quantization, the calculation you linked is the lower limit. But in practice, it's not that good (input not perfectly bandlimited, etc.) and each reconstruction filter will have various deviations from the unknown ground truth - which are all still valid renditions of it.

I don't think this is a problem at all, but there are measurable differences between filters and some information must clearly be lost.
 
I get the idea, but I must admit I'm sceptical. I've got some limited experience with neural network training and some more with signal processing. In essence, my gut tells me that "there is no free lunch": If you very selectively suppress specific frequency components, this is likely to have some undesired effects. The NN-filter FR also shows a very "grassy" presentation past ~24 kHz, which is likely due to the selective suppression that it is trained for, but the overall energy in the ultrasonic region is still very high. I would expect that this will show some very odd behaviour with otherwise benign signals.

Did you test the NN-filter with a broader selection of synthetic (sine wave) signals and possibly some actual music? I assume you didn't plot everything in your first post because it would have taken up too much space. But could you show some more varied examples of suppressed images in more complex signals?

  • Track Used: "Bubbles" by Yosi Horikawa (from the 13s to 28s mark).
  • Source Limitations: The original source file is an MP3, which means there is a native cutoff already present at around 18 kHz.
  • Filter Comparison:
    • Normal Slow Filter (Bessel): Mirroring (imaging) components in the ultrasonic region are highly noticeable.
    • My NN-Filter (Distill): The mirroring is noticeably suppressed compared to the slow filter, though some high-frequency content still remains.
    • Standard FIR(10ktap) Filter: The frequencies above the cutoff are completely eliminated.

bubbles_88k2_13s_15s_bessel_distill_fir_no_interval_spectrogram.png

If you would like to actually listen to the sound, I recommend that you try out the training data available on Github.
 
The idea that there is no ringing in the ground truth signal is not quite correct, as far as I understand it. Instruments create sound through vibration, which cannot start or stop immediately due to the inertia of the instrument itself. A vibration will always have a non-zero fade-in and fade-out time. Therefore, there will be components in the ground truth wave form which look like post ringing and likely also some minimal pre-ringing-like. At least for real instruments, maybe not so much for synthesized stuff.

I get that those vibrations are technically not what we describe as ringing when talking about filters - they are merely ultrasonic components of small-ish magnitude which get lost during downsampling. But the idea that such components should not exist at all in a reconstructed signal is incorrect, as far as I understand it.
Aren't you confusing the reverberation or ringing in the audio source with the ringing that appears when you try to upsample with a steep filter using a long tap?
The problem isn't with what's in the audio source, but with the ringing that appears when upsampling.


Concerning the timing argument itself, our ears may be sensitive on the microsecond level - I am not sure down to which exact resolution. However, at 44.1 kHz, every timing info below the sampling time of 22.6 µs is essentially lost. So pretty much any reconstruction we do inside that time window would be considered valid - be it using an NN, a minimum phase or a linear phase filter. I don't see any reason why the NN approach would be "more true" than the others here, but it is also not inherently "less true".

The argmument about "anticipatory energy" and "smearing of the envelope" sounds quite obscure to me and I am not sure how to interpret that correctly.
I've posted an image of a long-tap minimum-phase FIR square wave, so I think you'll understand that post-ringing appears on the sub-millisecond order (a few hundred microseconds).
It's true that we can barely perceive pitch above 20 kHz. However, it is said that we can perceive the resulting sub-millisecond envelope blurring.
I believe that there is some research that suggests we can perceive some level of perception during the conversion between digital, electricity, transducers, air, and electrical signals in the brain.

In any case, even if this NN upsampler produces the ultimate ideal form - original high-resolution sound source recreation - It would be difficult to clearly tell which is which.
However, with a sufficient system and training, you should be able to tell the difference - listen for long periods of time to see which one sounds less tiring, which one has a sharper metallic sound, etc.*

* reference:
 
Last edited:
but it is also just one valid interpretation of many
No, it isn't. There is only one valid interpretation of the band limited signal. That is the whole point of the sampling theorem. I grant you that there is no perfect reconstruction filter, so the perfect representation of the band limiited signal is never quite achieved, but the errors are below audibility - even at redbook. And your statement that "timing below sample time is lost" is wrong, by quite a margin.
 
Last edited:
Aren't you confusing the reverberation or ringing in the audio source with the ringing that appears when you try to upsample with a steep filter using a long tap?
The problem isn't with what's in the audio source, but with the ringing that appears when upsampling.
No, my main point is that the ringing introduced by the filter is a valid reconstruction. The best reconstruction of the downsampled signal is rarely a straight line connecting the sample points. Square waves are not bandlimited and therefore an exception. Here a theoretical example from my post on filters, using only sine waves:
index.php

There are clearly peaks and valleys in between samples of the downsampled signal. A good filter will reconstruct those correctly. If the highest frequency in this example was closer to 22 kHz, there would be no way to tell apart "valid reconstruction" and "bad ringing".

You assume that the filter ringing is a defect and that the ground truth signal (before downsampling) did not contain those frequencies close to Nyquist. That assumption isn't always correct. For the specific test case of square waves that assumption might deliver better results. But assuming that the reconstruction of a linear phase fast filter - including the ringing - is a good representation of the ground truth is equally valid.

I've posted an image of a long-tap minimum-phase FIR square wave, so I think you'll understand that post-ringing appears on the sub-millisecond order (a few hundred microseconds).
It's true that we can barely perceive pitch above 20 kHz. However, it is said that we can perceive the resulting sub-millisecond envelope blurring.
For minimum phase filters, the ringing frequency is farther away from Nyquist as far as I have seen. In that case, post-ringing at 17 or 15 kHz could potentially be audible, yes. That's why I would recommend using a fast linear phase filter. But I don't understand what "envelope" is referring to in this argument.

I believe that there is some research that suggests we can perceive some level of perception during the conversion between digital, electricity, transducers, air, and electrical signals in the brain.
Not sure what this is aiming at.

In any case, even if this NN upsampler produces the ultimate ideal form - original high-resolution sound source recreation - It would be difficult to clearly tell which is which.
However, with a sufficient system and training, you should be able to tell the difference - listen for long periods of time to see which one sounds less tiring, which one has a sharper metallic sound, etc.*

* reference:
I'm looking forward to the test results.
 
1771995691868.png

"envelope" is a curve that smoothly connects the peaks of an oscillating waveform.

In that case, post-ringing at 17 or 15 kHz could potentially be audible, yes
Ringing occurs at the cutoff frequency. In this case, I have the cutoff at 20kHz, so 20kHz ringing occurs. What do 15k and 17kHz mean?
No, my main point is that the ringing introduced by the filter is a valid reconstruction. The best reconstruction of the downsampled signal is rarely a straight line connecting the sample points. Square waves are not bandlimited and therefore an exception. Here a theoretical example from my post on filters, using only sine waves:
Let's have a serious discussion. I'm sure you've seen impulse responses and square waves. They ring a lot, right?
Do impulse response-like waves not exist in reality? No, they do. For example, very high-pitched metallic sounds fall into this category.
https://audiostock.jp/audio/131540

It has a very steep rise at first, and if this steep rise is upsampled with only waves below 20 kHz, ringing will occur at 20 kHz, the highest theoretical frequency.
As a result, the linear phase of the ultra-long tap FIR, which is said to be ideal, is disrupted by the ringing, which causes the waveform's "envelope" to be distorted.
If we can upsample using higher-order components of the Fourier series, that is, if we can use frequencies above 20 kHz, this ringing will not occur.
 
If you upsample using a mathematically deterministic, linear method, creating a brickwall will result in ringing. On the other hand, rejecting ringing and widening the stopband will result in image components.
This was clearly a mathematical trade-off.
That's why I used neural networks, reinforcement learning, and statistical stochastic inference.
For the smooth waveform presented by RandomEar, I wanted to eliminate mirroring by smoothly connecting it like an FIR filter with a steep cutoff.
On the other hand, for the sharp metallic waveform I presented, I wanted to minimize ringing and waveform distortion by using high-frequency components like a slow filter.

Please take a look at my first post. Sweep = An ideal sine curve has no mirror components, right?
On the other hand, when it comes to impulse responses or square waves, the key to this neural network is to minimize ringing by using high-frequency components above 20 kHz.
 
Let's have a serious discussion. I'm sure you've seen impulse responses and square waves. They ring a lot, right?
The ringing is not a result of filtering per se. Ringing is a natural "feature" of band-limited signals. And there are no pure impulses, or square waves in nature because *all* transmission mediums are band-limited somewhere.

You can construct a band-limited square wave without any filtering, by adding the Fourier sequence of every third harmonic at 1/(harmonic number) amplitude, stopping at the highest frequency harmonic that fits in the band limit.
Screenshot 2026-02-25 at 09.16.11.png

The result will have exactly the same ringing as results from a "perfect" square wave filtered to the same bandwidth with a linear-phase filter, and the frequency of the ringing will be that of the highest harmonic that fits in band. Yet it occurs without any filtering taking place.

In other words, ringing is a natural characteristic of band limitation - and is present as a natural part of the music as performed. It is not that the filters cause it, it is just that filters are the most convenient way of creating a band-limited signal.

Furthermore (and this is critical), your ears already act as a biomechanical low-pass filter, operating with a cutoff between 15 kHz and 20 kHz (even for young ears). Even if higher frequencies are present in the sound waves reaching your ear, they are filtered out by the mechanisms of the ear before being sent to your brain. The signal to the brain is band-limited with all the ringing that you would expect from any band-limited signal.



"envelope" is a curve that smoothly connects the peaks of an oscillating waveform.

And the envelope (as you have shown it there) exists in the spectrum of the sound as its own frequency tone. If you FFT that waveform you have given as an example, it will show both the carrier (high frequency) and the baseband signal (the low) in the spectrum. It is the low frequency our ears detect, just as they do everything else. They are not detecting the ultrasonic tone, and joining the peaks - because as pointed out by @RandomEar they are (by definition) not able to detect it.

No, they do. For example, very high-pitched metallic sounds fall into this category.

You mean for example, the strike phase "TS" of a cymbal (one of the "fastest" sounding sounds in percussion). Have you ever looked at the waveform of that sound? Here is an example:



The envelope of the strike from the start of the sound until peak amplitude takes about 8 ms (equivalent to about 125Hz). The highest frequency of that strike phase is less than 9 kHz. Hardly an infinite bandwidth impulse, or square wave.
 
View attachment 513618
"envelope" is a curve that smoothly connects the peaks of an oscillating waveform.
Thanks, I understand what you mean now. Is there evidence that this is relevant to our perception of the sound?

Ringing occurs at the cutoff frequency. In this case, I have the cutoff at 20kHz, so 20kHz ringing occurs. What do 15k and 17kHz mean?
It occurs near the cut-off frequency, but not necessarily at it. In my recent analysis, minimum phase filters displayed ringing at lower frequencies like 17 or 15 kHz, whereas linear phase filters rang at the Nyquist frequency. The test signal was music and not an impulse, which may be important.

Let's have a serious discussion. I'm sure you've seen impulse responses and square waves. They ring a lot, right?
Do impulse response-like waves not exist in reality? No, they do. For example, very high-pitched metallic sounds fall into this category.
True, it is close to an impulse.

It has a very steep rise at first, and if this steep rise is upsampled with only waves below 20 kHz, ringing will occur at 20 kHz, the highest theoretical frequency.
As a result, the linear phase of the ultra-long tap FIR, which is said to be ideal, is disrupted by the ringing, which causes the waveform's "envelope" to be distorted.
If we can upsample using higher-order components of the Fourier series, that is, if we can use frequencies above 20 kHz, this ringing will not occur.
Yes, your results show that you can suppress the ringing quite a bit using that method. I am just not sure the ringing is a problem and that the increased envelope is problematic, outside of looking ugly. I have seen claims on various forums and websites where people were under the impression that they could hear pre-ringing in linear phase filters, but I am not aware of well designed studies which support those claims.

You also need ultrasonic components to achieve the ringing suppression, which are generally considered undesirable in audio reproduction. For example, it is possible that such ultrasonic components would lead to increased IMD in the amplifier or tweeter. The NN-optimized filter is an interesting concept and like in most engineering problems, there are upsides and downsides to it.
 
Why is the stopband rejection of the 10k-tap FIR filter so poor? For comparison, here are some plots of the default "ridiculous overkill" filter I currently use in my resampler (shown for 96kHz to 44.1kHz conversion)—
Magnitude, DC-26kHz (vertical dashed line is 22.05kHz):
filter_mag.png
Flatness, 20kHz-22kHz (-0.01dB at ~21kHz):
filter_flatness.png
Passband ripple, 11.025kHz to 22.05kHz (note the scale):
filter_ripple.png

This filter is "only" 1270 taps, yet achieves >230dB stopband rejection with a very sharp transition band.
 
  • Like
Reactions: MAB
Such high quality upsampler is useful for digital recordings from the last century when analog low pass anti aliasing filters were used before ADC. Nowadays all recordings pass through digital filters that stain the sound with these pre and post 22.05khz resonance. The resonance itself is considered inaudible unless you are a Batman or Catwoman, it's the beat with entering and exiting frequecies is the problem. This is noticed by the musicians that the sound of the recorded instrument is different what he was playing, specifically drummers hear a tik before the stick hits the edge of the drum that he didn't do it while playing.
Once at recording it is stained, it doesn't matter any more to have it passing ultra transparent reproduction.
 
Such high quality upsampler is useful for digital recordings from the last century when analog low pass anti aliasing filters were used before ADC. Nowadays all recordings pass through digital filters that stain the sound with these pre and post 22.05khz resonance.
What resonance?

The resonance itself is considered inaudible unless you are a Batman or Catwoman, it's the beat with entering and exiting frequecies is the problem. This is noticed by the musicians that the sound of the recorded instrument is different what he was playing, specifically drummers hear a tik before the stick hits the edge of the drum that he didn't do it while playing.
Any proof for this?

Once at recording it is stained, it doesn't matter any more to have it passing ultra transparent reproduction.
This just sounds like fairy tales.

Short excursion: Analog filters have advantages, namely a lower latency and a reduction in the required compute power. The latter was relevant in the early days of digital recording, which is probably why such filters were more popuplar then. However, realistic analog filters are significantly less steep in the frequency domain (meaning either an earlier drop-off in the pass band or more ultrasonic noise and aliasing), are not linear phase, have higher tolerances than digital ones, can drift over time and with temperature and have other disadvantages [1, 2]. Their step response can also show ringing [1], which I assume you dislike about commonly used sharp digital filters.

You can easily design digital filters without ringing by accepting god-awful frequency response curves. The same can be done in the analog domain. This audiophile idea that "analog=better" is mostly a misunderstanding / lack of knowledge about how stuff is engineered.

Since this thread isn't about analog filters or old music, I kindly suggest to start a new thread if you want to discuss this in more detail.
 
In fact it is neither resonance nor ringing. When an impulse enters the filter, it will get multiplied by the coefficients with alternating sign as sinc function for example, this why it is always half the sample rate to be 22.05khz for CD. By this, the filter betrays its programed coefficients and the number of samples used before and after the interpolating value. Instead of pulse, if a transient signal enters, as piano or guitar note, it will get multiplied by these coefficients yielding sum and difference frequency. If the transient has 16khz component, it will generate 6khz beat which is going to remain in the recording file hence definitively stained.
From recording to playback, the signal will pass several digital filters so one more with oversampling, doesn't make any difference.
I tested the performance of SRC4190 a few years ago with 16bit 48khz in and 16bit 48khz out with 4 times oversampling 32bit in between. 4 young ears couldn't hear any difference with or bypassed.
 
In fact it is neither resonance nor ringing. When an impulse enters the filter, it will get multiplied by the coefficients with alternating sign as sinc function for example, this why it is always half the sample rate to be 22.05khz for CD. By this, the filter betrays its programed coefficients and the number of samples used before and after the interpolating value.
The filter acts exactly as designed. I don't see the problem here. Focsuing on details like the fact that positive and negative coefficients exist detracts from what matters in the end: Phase and frequency response.

Instead of pulse, if a transient signal enters, as piano or guitar note, it will get multiplied by these coefficients yielding sum and difference frequency. If the transient has 16khz component, it will generate 6khz beat which is going to remain in the recording file hence definitively stained.
Since 16 kHz is way below the Nyquist frequency for typical audio sampling rates (≥44.1 kHz), there will be no problem perfectly replicating that (see here). True impulses are not bandwidth limited though, which means they can't be replicated faithfully in a bandlimited 44.1/48/whatever kHz signal - no matter what filter you choose.

By selecting a filter, you choose which "defect" is more acceptable to you: Filter ringing/Gibbs phenomenon, or a slower response/suppression of high frequency content. Stopband attenuation or passband attenuation. Ripple or smoothness. But again: This is independent of whether the filter is analog or digital.

From recording to playback, the signal will pass several digital filters so one more with oversampling, doesn't make any difference.
I tested the performance of SRC4190 a few years ago with 16bit 48khz in and 16bit 48khz out with 4 times oversampling 32bit in between. 4 young ears couldn't hear any difference with or bypassed.
So... it's a problem but it is inaudible?
 
Back
Top Bottom