• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Where is the AI audio upscaler?

ZolaIII

Major Contributor
Joined
Jul 28, 2019
Messages
4,069
Likes
2,409
@Sam Ash it's up to extensive use of tool chain (example; MPC classic + shredder effects, MadVR... System) same as with music and again source dependant (no amount of post processing will fix a bad one). Unfortunately we don't have cross compatible ones (video will be pronned back to OS default sound output driver).
 

Sam Ash

Active Member
Joined
May 24, 2017
Messages
162
Likes
37
Some good points you've made there @ZolaIII. As a personal preference, I listen to all my music via Dolby ProLogic IIx and like the resulting sound stage, I prefer it to DSU. Native spacial tracks are even better and very enjoyable. I think one benefit that DSU provides is that it is designed to make use of all the drivers in an Atmos configuration.
 

EdTice

Senior Member
Joined
Aug 18, 2020
Messages
353
Likes
175
Some good points you've made there @ZolaIII. As a personal preference, I listen to all my music via Dolby ProLogic IIx and like the resulting sound stage, I prefer it to DSU. Native spacial tracks are even better and very enjoyable. I think one benefit that DSU provides is that it is designed to make use of all the drivers in an Atmos configuration.
I'm a fan of PLIIx as well for music. It does a great job of keeping the vocals (at least mostly) coming from where one would expect with the rears carrying mostly instruments in a way that makes the music more immersive. I don't like Neo because it tends to make the vocals less clear. PLIIx is not "AI" but in my mind that's better since it's predictable and effective.
 

ZolaIII

Major Contributor
Joined
Jul 28, 2019
Messages
4,069
Likes
2,409
@Sam Ash for that I use EBU R128 loudness normalization. Neither listen in surround nor use Dolby's for very long time (there whose days when I did with SB).
 
OP
anphex

anphex

Addicted to Fun and Learning
Forum Donor
Joined
May 14, 2021
Messages
662
Likes
870
Location
Berlin, Germany
Hi everyone.

thanosmeme.jpg


For a while after my post I thought to myself. Why just not make this myself?
I am no beginner with Python programming language and, thankfully, these days there are some sophisticated and accessible machine learning frameworks around. Note that even if the following might sound scientific, it was just me fiddling around with some code I barely understand and everything I tried out was based on hunches and lot's of googling.

Spoiler for the ones who want me to get to the point if it: No, I haven't achieved a real credible upscaling so far, but current approaches are promising.

The main goal is not to just upscale the bit depth and sampling rate but also to restore the detail that lossy compression usually destroys. Turns out, it's pretty hard if you want good, hifi-like results.

So far, I tried a plain recurrent neural network (RNN), a convolutional neural network (CNN) and an generative adversarial network (GAN).

Parameters for all networks you're going to see:
Dataset: 19 Qobuz bought 192/24 songs of mixed genres, only used the left channel for the sake of simplicity. I might switch to stereo once I get good results on mono.
Training steps: 1000, usually more but for the sake of comparision I'll stop after 1000 for the following examples.
Sample batches per step: 16 (16*1000*2048 = 32768000 bytes processed)
Samples to analyze at once: 2048
Input data: 128 kbps mp3 converted from 192/24 source data with ffmpeg
Target data: lossless 24 bit 176khz flac
Training time: depending on network complexity between 5 and 15 minutes.


192 Khz wasn't possible since it's not an integer multiple. Neural networks hate that. There is probably a way to get to that number but I cound't find a way that didn't include very complicated number crunching. This is a problem because there will be inevitably rounding offset errors that could make training more difficult. A little offset of a few samples isn't bad and even may make the model more robust to different inputs but at some point it may just confuse the model if the offset gets too big.


The easiest one is a RNN, a network with one or more hidden layers where all nodes are connected to the layer of the prior and next layer. It's rather "stupid" since it's more for data extraction, simple classification and doesn't learn any patterns or features. As you might guess, results were unexciting and nothing a resampler couldn't have done better.

Results:
2022-03-23---17_09_40.jpg

The network does basically nothing to change the basic pattern of the input data, it just tries to push the signal more closly to the flac input data. Therefore, it is not fit for this purpose.



The next one and so far the most successfull was the convolutional neural network. Here we use hidden layers with some math magic that searches for patterns in the data and then feed those to another upscaling deconvolution layer that applies its own patters again to the already found patterns to create new stuff.

Results:
2022-03-23---17_59_30.jpg


Now we're getting much more interesting output! Look how it smoothes the edges and tries to get back to the original curve. Still, it looks like it's just kinda smoothing the mp3 - maybe even better than what a resampler would produce - but not enough to call it a restoration since the details are still missing. Also note the phase shift. It only about 5 samples and looks bad since it's extremely zoomed in (we're looking at about 0.0004 seconds here), still it bothers me. This is why I am thinking of splitting the frequencies.



We made a network that made a reasonable reconstruction but without "dreaming up" the fine detail. For that the next neural network comes into play.
The third one, that is the weirdest but most exiting so far, is a generative adversarial network. Simply put it's nothing but crazy and genious at once. We have two networks. A generator, the creates patterns from pure random noise input (yes, really) and tries to trick another network: the discriminator. The discriminator is the other network that tries to guess wether the ouput from the generator is "valid" output or not. This produces a loss, a measurement for error, which is sent back to the generator. So basically, the generator makes stuff up from random input data until it can trick the dicriminator. This is the core of many AI gadgets like deep fakes, aging simulators, etc. you see today.

Unfortunately, adjusting the parameters is really hard for this one. Thus I didn't manage to produce any reasonable results apart from "Hm, yes, it's trying to make something." I am still looking into this in my spare time and I am confident that I will come up with something interesting after researching for a while. It's much more comlicated than the two networks above since much more can go wrong. Most of my attempts look like this:
2022-03-23---18_45_32asd.jpg

I'll update if I got it to work!



Current main issues
- phase difference during training that may reduce the quality of the restored signal
- adjusting the training hyperparameters to produce at least a reasonable AI output curve when using GAN


Next approaches I am looking into:
- create separate, smaller models for dedicated frequency ranges to reduce the issue of phase offset
- try out Long Short Term Memory Network and see how it handles the time domain issues
- get GAN to work :C

Thank you for reading. I'll post again.


If anyone with more expertise is interesting in tackling this, I can share my Jupyter Notebooks or use Collab.

Edit: If anyone is interested in trying out my best working upscaler so far (second graph image) just send me the file and I'll throw it to the robots :)
 
Last edited:

Katji

Major Contributor
Joined
Sep 26, 2017
Messages
2,990
Likes
2,273
wow I was thinking about it a few days ago, that it is 30 years since my friend was working on his master's thesis neural network and we were talking about it every day. 30 years. Surely it has developed since then. I should...google.
The main thing we talked about was OO, C++ vs. Objective C.
Then life took different turns. :rolleyes: :oops: :facepalm: :)
 
OP
anphex

anphex

Addicted to Fun and Learning
Forum Donor
Joined
May 14, 2021
Messages
662
Likes
870
Location
Berlin, Germany
Another idea that came to me right now is adding about 1-5% white noise to the mp3 to give the neural network some "fake detail" that it could remove or reshape depending on the detail in the target singal. This paired with frequency splitting and phase aligning could yield very interesting results.

I will try this out later today! Also I am thinking of making a dedicated thread for the design of this upscaler to make it more visible as a community/open source project.
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,223
Likes
17,799
Location
Netherlands
Nice try! Fun to see this getting some action.

A few things: usually you would divide your dataset into a training set, and a test set. Did you do this?

You can use both channels and extend your dataset by a factor of two.

The graphs you show are just waveforms? If so, don’t connect the samples with lines, those are not correct waveforms representations.

I would not worry to much about the sample offset as long as it’s a fixed one. Just take it into account when evaluating.

Probably better use something like Deltawave to compare the files. Or define some KPI’s to evaluate performance.

When doing blockwise processing note that the start and end data might not be very well processed. One could use overlapping windows and only use the middle part.

What kind of cost function do you use?
 
OP
anphex

anphex

Addicted to Fun and Learning
Forum Donor
Joined
May 14, 2021
Messages
662
Likes
870
Location
Berlin, Germany
Nice try! Fun to see this getting some action.

A few things: usually you would divide your dataset into a training set, and a test set. Did you do this?
Yes! I have a train, eval and test dataset with the ratio 98, 1 ,1. It's not the best ratio but this makes quick tests and plotting way more comfortable.
You can use both channels and extend your dataset by a factor of two.
I did this but the phase offset increases with lenght of the dataset due to the inevitable rounding errors of resampling. Once I get the smaller dataset to phase align I'll feed it the whole data I've got.
The graphs you show are just waveforms? If so, don’t connect the samples with lines, those are not correct waveforms representations.
What do you mean exactly? Changing the plot design of matplotlib? Of course I can do that if you point me to the recommended settings.
I would not worry to much about the sample offset as long as it’s a fixed one. Just take it into account when evaluating.
No, unfortunately it's shifting over the dataset.
Probably better use something like Deltawave to compare the files. Or define some KPI’s to evaluate performance.
Will look into this!
When doing blockwise processing note that the start and end data might not be very well processed. One could use overlapping windows and only use the middle part.
I actually noticed when listening to real life data. When testing on random 128 kpbs mp3 everything turns out pretty good so far, but once I process a file with a different bitrate than 128k you can hear audible pops once in a while. Not sure how to handle varying input bitrates or even VBR right now.
What kind of cost function do you use?
Mean squared error (so that large errors are "punished" more severly) and as optimizer mostly Adam with default settings. I've got the best results so far with that combination.
 
OP
anphex

anphex

Addicted to Fun and Learning
Forum Donor
Joined
May 14, 2021
Messages
662
Likes
870
Location
Berlin, Germany
Hello guys! After hours of trying different models (GAN, RNN, CNN, DNN) I came to a conclusion and answer as to why no one has really done this.
It's incredibly difficult for sequential data!

Processing and upscaling images isn't really a problem since every image you process is closed in on itself and has no relation to other pictures. Music or audio in general however is like one continuous line, data stream, signal. Now the issue with audio is that if you process individual segments of a sound file it doesn't know the other segments and thus the issue of popping and clicking arises since the first and last value of individual segments rarely match. You could fix this with some smoothing after processing but it's just an ugly solution.

Also, I tried to do this but I couldn't find any obvious jumps in the audio data to smooth which begs the question if those click sounds come from somewhere else.

I think the best way would to to feed the whole file - not just segments - really the whole file into a neural network. But this requires insane amounts of processing power and ram. It's like you're trying to feed a squirrel a whole ice machine instread of a spoon.

I tried it. No matter how "small" I tried to make the song, once you try to send 3 minutes of audio through a network it becomes the data equivalent of several music albums. Even my RTX 3090 raised the white flag.

So either you rent a supercomputer or at least a extremely expensive cloud AI training instance or we just have to wait until someone comes up with a very smart paper on this. Looking at the recent AI process this shouldn't take long anymore.
 

-Matt-

Addicted to Fun and Learning
Joined
Nov 21, 2021
Messages
675
Likes
551
Interesting stuff!

This part makes me worry...
Not sure how to handle varying input bitrates or even VBR right now.
...and the problems with phase alignment.
No, unfortunately it's shifting over the dataset.
Are you certain that the ffmpeg conversion has worked as expected? The original and pre AI mp3 differ by more than I'd expect. If this training data is messed up it definitely won't work.

E.g. For the two waveforms to line up temporally you might need to repeat sample values when plotting the lower res file? Or allow non-uniform temporal spacing?

The graphs you show are just waveforms? If so, don’t connect the samples with lines, those are not correct waveforms representations.
I think voodooless is suggesting plotting just dots for each sample value - not joining them with a line. (linestyle="None", marker=".")

Processing and upscaling images isn't really a problem since every image you process is closed in on itself and has no relation to other pictures.
I think some video processing does attempt to track objects as they move from frame to frame. I.e. It also has a temporal buffer. This can cause visual artifacts which are probably the equivalent of the audio clicks that you are hearing.

When doing blockwise processing note that the start and end data might not be very well processed. One could use overlapping windows and only use the middle part.
Yes, this can be a way to mitigate the clicks at the start and end of processing blocks.
 
Last edited:

-Matt-

Addicted to Fun and Learning
Joined
Nov 21, 2021
Messages
675
Likes
551
How about this quasi-continuous scheme:

Let's say you are trying to upscale from 48kHz to 192kHz (4x)...

For rollout, have two inputs (which could be concatenated).

Input 1 is a buffer containing the last 1000 upscaled samples (initially will just be zeros or maybe sub-audible noise).

Input 2 is the next 6 low res samples.

The NN has to try to predict the next 24 upscaled samples (but it already has correct values for 6 of these from input 2 so it just needs to fill in the gaps, it also knows the audio style from input 1). Crucially the last predicted sample will always be one of the ones for which the correct value is provided by the low res file (via input 2), hence large jumps and discontinuities should be avoided.

The 24 predicted samples (at 192kHz sample rate) are slid into the input 1 buffer, dropping old samples from the other end (for playback).

Keep repeating prediction of the next 24 samples until the end of file is reached.

For training, losses can be calculated directly from the difference between the original 192kHz file and the NN predicted samples, as you probably already do, but you need to be carefull that the comparison is always between samples at the same equivalent time/phase.

Change array sizes to more appropriate values if needed.

Decide whether or not to allow prediction for the samples already given by input 2 (might be desirable if you are also increasing bit depth). Decide whether or not these samples should be excluded from the loss calc.
 
Last edited:

thewas

Master Contributor
Forum Donor
Joined
Jan 15, 2020
Messages
6,741
Likes
16,174
Remember even my 2001 Technics SL-PS7 CD Player having such a switchable gimmick which now also their current products use, they call it "Re-Master Processing", here is a marketing photo about it:

ast-1242642.png.pub.thumb.644.644.png


Of course I didn't hear any difference.
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,223
Likes
17,799
Location
Netherlands
Remember even my 2001 Technics SL-PS7 CD Player having such a switchable gimmick which now also their current products use, they call it "Re-Master Processing", here is a marketing photo about it:

ast-1242642.png.pub.thumb.644.644.png


Of course I didn't hear any difference.
Staircase alert! I'd also be worried about that "frequency expansion". That's a leaky filter.
 

-Matt-

Addicted to Fun and Learning
Joined
Nov 21, 2021
Messages
675
Likes
551
This is an example of what I get when I use ffmpeg to convert from a 192kHz, 24bit flac, to 16bit mp3 files with various sampling rates and bit rates. (No phase offset).
Figure_1.png

Using for example:
ffmpeg -i input.flac -ab 128k -ar 22050 -sample_fmt s16p output.mp3
 
Last edited:

-Matt-

Addicted to Fun and Learning
Joined
Nov 21, 2021
Messages
675
Likes
551
Edit... the lower res mp3 files do seem to struggle with some peaks. (320kbps, 44.1kHz does much better).
Figure_2.png

(Updated image so that horizontal axis is exactly 1ms long).
 
Last edited:

nerdstrike

Active Member
Joined
Mar 1, 2021
Messages
257
Likes
309
Location
Cambs, UK
I'm not sure of the utility, but as Matt put it,
How about this quasi-continuous scheme:
A sliding window ought to give you enough neighbourhood to inform on what is right or wrong with any specific sample.

Presumably that window needs to be wide enough to capture the considered inputs for a compressor if you are to invent the correct detail that was removed. Temporal anti-aliasing works in a similar fashion by considering past frames.

Heck, you can run stripes of the whole track all at once in different threads and it'll fly.

Ultimately my concern is what you are training: To recognise the foibles of compression algorithms plus settings. Could be very unstable if overtrained, and I don't know if you can tell in retrospect how some music was compressed.

The hardest part in any "data science" is getting your source material right, i.e. a fair sample of possible music signals, with a hold-out not trained for assessing predictive power.
 
OP
anphex

anphex

Addicted to Fun and Learning
Forum Donor
Joined
May 14, 2021
Messages
662
Likes
870
Location
Berlin, Germany
Hey guys, I didn't post here for a while since I didn't achieve anything ground breaking. But there was progress. The main problem all along was that I was using methods that are usually meant for image processing and not for sequential data.

After digging some more a found the very promising LSTM layers that I am still fiddling around with. There is a magic word: stateful RNN. This guarantees that there is no clicking and popping between the samples. That's already a really big progress.

Next thing is my training data. I just generated noise uniform noise with numpy, and converted it to mp3. That's probably the easiest 5 hours of training data I ever made haha. And no phase alignment issues nor sample offsets!

Currently I am trying to create a mix of deconvolution (aka upscaling) and LSTM.
But more importantly, in the progress I came to realize just how efficient and good MP3 is from a compression perspective. What I am trying is mostly just tinkering around instead of achieving a great innovation, but it's extremely fun.
 

KikoKentaurus

Member
Joined
Apr 8, 2020
Messages
80
Likes
66
Location
Klaipeda, LTU
Look up Soectral Band Replication. It extends the spectrum of glossy codecs which have lower bandwidth than 22.4 Khz. It sounds impressive but then gets annoying to my ears.
Any guides/instructions on how to use it? F.e. where to get software, manuals, etc.
 

KikoKentaurus

Member
Joined
Apr 8, 2020
Messages
80
Likes
66
Location
Klaipeda, LTU
Hey guys, it's been a long year!

Here's something for you to check:
https://audioldm.github.io/audiosr/
It's very recent - academic paper been published this September, so 2 months old

P.S. - https://replicate.com/nateraw/audio-super-resolution?input=form&output=preview usable model
 
Last edited:
Top Bottom