Understanding Upsampling/Interpolation

amirm · Jan 20, 2021

Let's dial the emotions down folks. It is not doing us any good.

TabCam · Jan 21, 2021

Although the math and perceptual senses are different, the underlying goal is the same. How can we as accurate as possible regenerate the original signal. We know we have enough information if we sample with double the frequency. If we half the sample rate, we cannot restore the signal. So it is not like can we regenerate the signal from a 10 kHz file but can we use deep NN techniques to reconstruct the signal more accurately than with other means?

btw, it was never about that it was the same, just that the argument that image processing techniques showed an example that upsampling does not help is not a valid argument anymore, as the last examples show they help for image processing.

amirm · Jan 21, 2021

TabCam said:
Although the math and perceptual senses are different, the underlying goal is the same. How can we as accurate as possible regenerate the original signal. We know we have enough information if we sample with double the frequency. If we half the sample rate, we cannot restore the signal. So it is not like can we regenerate the signal from a 10 kHz file but can we use deep NN techniques to reconstruct the signal more accurately than with other means?

Straight resampling does not thing to restore original data as discussed in the OP. It simply makes the data suitable for the new sample rate without artifacts added.

There can be non-linear algorithms to attempt to predict the lost data. For audio a technique called SBR or Spectral Band Replication for example takes the high frequencies that are there and assumes that the original data had the same but at lower amplitude. This is used in low bit rate music codecs that use lower than 44.1 kHz sampling rate, to reproduce the "original CD" sound. We developed this at Microsoft and while it worked surprisingly well on some content, it also generated annoying high-frequency response at times.

For image processing, there are tons more algorithms as JJ alluded to. You may have heard of "super resolution" for example where multiple images, say in a video, are combined to create one higher resolution image. Just like the audio example above though, artifacts are easy to find. I remember Toshiba advertising such a technique in their TVs a while back and all I could see was over sharpening of the image and no real increase in resolution. Researchers tend to pick content that show off these techniques to work a lot better than they do across wide range of content!

If you know the type of content you have, more optimal algorithms can be used for better results. But there will always be limitations.

j_j · Jan 21, 2021

TabCam said:
Although the math and perceptual senses are different, the underlying goal is the same. How can we as accurate as possible regenerate the original signal.

There's a couple of things here. First, what do you use as an accuracy measure? For images for viewing, the process is enormously nonlinear and has to be (that is, for viewing, not for analysis). For sound, nonlinearity must be avoided at all costs.

But then you say "the original signal". Ok, if you have a fixed photo that's been downsampled, maybe you have an original. If you have a statue and you can replicate the lighting, that's "an orginal".

But what's the original audio signal. Seriously. What is the "original".

j_j · Jan 21, 2021

amirm said:
This is used in low bit rate music codecs that use lower than 44.1 kHz sampling rate, to reproduce the "original CD" sound. We developed this at Microsoft and while it worked surprisingly well on some content, it also generated annoying high-frequency response at times.

The problem, of course, with that kind of processing is that you have to assume what you threw out sounded like. You can't actually describe it accurately, or you'd have to increase the bit rate (Shannon, as always, wins), so you have to develop a hypothesis. Then you put in "We Shall Be Happy" from Jazz by Ry Cooder, and the hypothesis is WRONG! (I'm thinking the way that Marissa Tomei says "wrong" in "Cousin Vinny")

If you know the type of content you have, more optimal algorithms can be used for better results. But there will always be limitations.

In other words, Shannon rules. Always. Information theory is real.

Sashoir · Jan 21, 2021

My apologies if this is not a pertinent question to this thread (it *seems* related to me, but I'm still very ignorant). Are these information-theoretic properties of sampling frequency and audible frequency the reason why there are comparatively many wireless sub-woofers and adaptors vs. "normal" speakers? Because, e.g. you can transmit A3 and below at 24 bit resolution but use only ~ 10.6Kbits/second of bandwidth if you low-pass filter and "down-sample" the line signal? Vs. ~1.5 Mbits for the full audible spectrum?

TabCam · Jan 21, 2021

j_j said:
There's a couple of things here. First, what do you use as an accuracy measure? For images for viewing, the process is enormously nonlinear and has to be (that is, for viewing, not for analysis). For sound, nonlinearity must be avoided at all costs.

But then you say "the original signal". Ok, if you have a fixed photo that's been downsampled, maybe you have an original. If you have a statue and you can replicate the lighting, that's "an orginal".

But what's the original audio signal. Seriously. What is the "original".

Easiest to take a lot of high resolution recordings. Sample those down to 44.1/16 bit and see which upsampling algorithm comes closest. Will still be difficult but is at least repeatable/verifyable. An electrical approach can also be taken but then there is the extra DAC/ADC step which adds it's own additional distortion. The last remark just as the DAC conversion is the reason we want as good quality as we can, either with upsampling or original sample rate.

Last remark, I am not talking about dynamic decompression, that is not the goal although I hate the loudness war.

j_j · Jan 21, 2021

Sashoir said:
My apologies if this is not a pertinent question to this thread (it *seems* related to me, but I'm still very ignorant). Are these information-theoretic properties of sampling frequency and audible frequency the reason why there are comparatively many wireless sub-woofers and adaptors vs. "normal" speakers? Because, e.g. you can transmit A3 and below at 24 bit resolution but use only ~ 10.6Kbits/second of bandwidth if you low-pass filter and "down-sample" the line signal? Vs. ~1.5 Mbits for the full audible spectrum?

I'm not offhand sure what protocols are used for a bluetooth subwoofer, but indeed, there is a lot less total information content in a signal that only goes up to say 300Hz. Not only is the bandwidth required much smaller, the dynamic range required is also much smaller due to the lack of hearing sensation at low levels and low frequencies.

But, frankly, I suspect they use some simple codec and run at either 44 or 48, simply to have cheap DAC hardware. That's pure supposition, because that's cheap to do.

j_j · Jan 21, 2021

TabCam said:
Easiest to take a lot of high resolution recordings. Sample those down to 44.1/16 bit and see which upsampling algorithm comes closest. Will still be difficult but is at least repeatable/verifyable.

So the question you're really asking (it's a valid question, but not likely to enlighten much) is "does the signal above 20kHz matter to adult humans, adolescent humans, youngsters".

It is easy to show (see this presentation https://www.aes-media.org/sections/pnw/pnwrecaps/2016/jjsrc_jan2016/ ) that the in-band results can be gotten to arbitrarily close to the original in-band part of the signal.

Sashoir · Jan 21, 2021

Thank you, j_j!

TabCam · Jan 21, 2021

j_j said:
So the question you're really asking (it's a valid question, but not likely to enlighten much) is "does the signal above 20kHz matter to adult humans, adolescent humans, youngsters".

It is easy to show (see this presentation https://www.aes-media.org/sections/pnw/pnwrecaps/2016/jjsrc_jan2016/ ) that the in-band results can be gotten to arbitrarily close to the original in-band part of the signal.

The question is more, does upsampling help to improve timing reconstruction accuracy?

j_j · Jan 22, 2021

TabCam said:
The question is more, does upsampling help to improve timing reconstruction accuracy?

To any human-perceptible level, absolutely not.

The time resolution of 16/44 is

1/(2*pi*20000*2^16) seconds. That is already absurdly below the 5 microsecond sensitivity demonstrated in the most sensitive observations made of the human auditory system.

Yeah, let me get out matlab and calculate that for us: 0.12 NANOseconds. Yeah, that'll do. That's for redbook CD, note. going to 24 bits divides that by 256. Going to 40kHz bandwidth at 96kHz sampling divides that by another factor of 2.

That is, of course, assuming that the SRC isn't using crappy filters. I've measured that a time or two (see the talk cited above on sampling rate conversion) but in general, other than the mistake of all half-band filters (see "regularity theorem") most filters are at least half-decent today.

In house, we use one we wrote. It's done in double-precision, and a pass through it is better than 24 bits at all times. I don't worry much about that.

TabCam · Jan 22, 2021

You seem to assume that our auditory system uses 20 kHz tones for localization. We use a mixed approach of Interaural Time Difference, Interaural Intensity Difference, group delays and phase for the lower frequencies. I think that mixed approach makes for so many different experiences in soundstage, localization etc., not only with the same speaker but also with different electronics.

I think your calculation of 0.12 ns is way off but let's say 1 microsecond is almost an order better than necessary so more than accurate enough. However, as phase, group and intensity also play a role, that can maybe account for so many different experiences. That is in amplitude so a property of the reconstruction filter. If you half the frequency, we only have four samples per wave but is it enough to also account for modulation of intensity and frequency? Would a better

In short, we could use high accuracy recordings (>=176 kHz/24 bit/DSD512) downsampled to 44.1/16 or 48/16 and use more advanced upsample algorithms to see if those reconstruction filters provide more accuracy and look at benefits and/or drawbacks, if any.

Thomas savage · Jan 22, 2021

Can I intersample the bags under my eyes ? Can I intersample life ?

Asking for a friend, I'm young and frankly have no such care .. in fact burning my abundant youth right now by abusing myself..

Sashoir · Jan 22, 2021

You'll go blind. And spectacles are inconvenient with masks.

Killingbeans · Jan 22, 2021

TabCam said:
I think that mixed approach makes for so many different experiences in soundstage, localization etc., not only with the same speaker but also with different electronics.

I think a lot of things too. Most of the time it only takes at bit of research for me to figure out, that I was just daydreaming.

Since I started emphasizing critical thinking a bit more a few years ago, I've been piecing together a personal picture of what effects and resulting influences constitute the vast majority of the different experiences being reported with the use of different audio gear. It's not a bulletproof picture, but one thing's for certain... ultrasonics are not a part of it.

TabCam said:
If you half the frequency, we only have four samples per wave but is it enough to also account for modulation of intensity and frequency?

Can you give an example of that sort of modulation taking place?

TabCam said:
and look at benefits and/or drawbacks, if any.

How would they be defined, and how would you detect them?

I'm genuinely curious. Most of the things you wrote in that post makes very little sense to me. But I'm not the sharpest tool in the shed, at least compared to a lot of the users in here, so there's a real risk that I'm missing something.

BTW, I think the advent of using neural networks for data processing is very interesting. I don't see the point for upsampling of perfectly fine recordings, but I guess it could be used to clean up bad or damaged recordings. But then again, even the best educated guess is still just a guess, no matter how well you train the network beforehand. And like Amir says, there will be situations where it screws up badly, so real time implementation could be problematic. It would make more sense to listen to some recordings from a good tribute band in stead. (Sorry for going off topic)

pkane · Jan 22, 2021

TabCam said:
I think your calculation of 0.12 ns is way off

Really? Please explain how this missed the mark: https://troll-audio.com/articles/time-resolution-of-digital-audio/

RayDunzl · Jan 22, 2021

Sound travels (if I mathed it correctly) 0.343mm in a microsecond.

I don't think I could detect any sonic differences attributable to that level of precision.

NTK · Jan 22, 2021

TabCam said:
In short, we could use high accuracy recordings (>=176 kHz/24 bit/DSD512) downsampled to 44.1/16 or 48/16 and use more advanced upsample algorithms to see if those reconstruction filters provide more accuracy and look at benefits and/or drawbacks, if any.

What are "more advanced upsample algorithms"?

Here are the audio analogs of the first 4 image interpolation algorithms in the Wikipedia page you referenced in your post #22.

As you can see, all 4 interpolated signals passed through the digital samples. Therefore, they are all "legitimate" signals that, when sampled, give the exact same digital samples. There are in fact infinite number of other waveforms that are able to do the same.

What the sampling theorem says is, if the original analog signal is band limited to <0.5fs, there is only one unique waveform that will give those digital samples. What the theorem does not explicitly say is, there are an infinite number of non-band limited waveforms that will give those samples too (see foot note).

Out of the 4 interpolations shown, only the sinc interpolated one is band limited to 0.5fs. Therefore, if the original signal is band limited, the sinc function is the only (or most) accurate reconstruction.

If the original signal is not band limited, reconstruction is underdetermined and there are infinite number of solutions to he reconstruction problem. Now the question is which one of those infinite number of waveforms do you choose? And therefore JJ's question of what is the original? How do you define accuracy, i.e. what is a more "accurate" reconstruction when you have to guess the missing information?

What looks good for images (on guessing what the missing contents should be) does not work for audio. Guessing the missing data is a non-linear process. What works well for one subset of signals may fail miserably on other subsets. Obviously, if one is interested enough, he/she can try training a neural net to see how well it works. One of the first bigger problems will be to find the appropriate loss function that is perceptually relevant and will also give nice gradient properties for gradient decent to work.

Footnote:
Of course, that was the original question. Before the sampling theorem, people were unsure if we can accurately reconstruct an analog signal from discrete samples at all. The sampling theorem answered this question (and gave the criteria that must be met).

j_j · Jan 22, 2021

TabCam said:
You seem to assume that our auditory system uses 20 kHz tones for localization.

No, I do not assume that. The bandwidth is what matters, not the frequency. If the frequency range was 1Hz wide between 19999 and 20000 you'd still only have 1/(2*pi*1) for that part of the denominator. remember, no tone last forever, and therefore no tone is actually one single frequency. If you have a very narrow bandwidth, the "ringing" is going to confound your measurements at some point in the timing. Now, theoretically, it should settle out more or less completely, but you would be surprised to realize that it never goes away completely, quite. It can't.

Your questions about level, intensity, etc, all fail to understand the very basics of sampling theory. Please find yourself a good tutorial for that, and read it.

And, again, the answer to your question in terms of in-band signals, is "no". There is a specific, precise mathematical definition, and it can be executed on a low-end computer in real time these days.

Understanding Upsampling/Interpolation

Founder/Admin

Active Member

Founder/Admin

Major Contributor

Major Contributor

Active Member

Active Member

Major Contributor

Major Contributor

Active Member

Active Member

Major Contributor

Active Member

Grand Contributor

Active Member

Major Contributor

Master Contributor

Grand Contributor

Major Contributor

Major Contributor

Similar threads