Hi everyone.
For a while after my post I thought to myself. Why just not make this myself?
I am no beginner with Python programming language and, thankfully, these days there are some sophisticated and accessible machine learning frameworks around. Note that even if the following might sound scientific, it was just me fiddling around with some code I barely understand and everything I tried out was based on hunches and lot's of googling.
Spoiler for the ones who want me to get to the point if it: No, I haven't achieved a real credible upscaling so far, but current approaches are promising.
The main goal is not to just upscale the bit depth and sampling rate but also to restore the detail that lossy compression usually destroys. Turns out, it's pretty hard if you want good, hifi-like results.
So far, I tried a plain recurrent neural network (RNN), a convolutional neural network (CNN) and an generative adversarial network (GAN).
Parameters for all networks you're going to see:
Dataset: 19 Qobuz bought 192/24 songs of mixed genres, only used the left channel for the sake of simplicity. I might switch to stereo once I get good results on mono.
Training steps: 1000, usually more but for the sake of comparision I'll stop after 1000 for the following examples.
Sample batches per step: 16 (16*1000*2048 = 32768000 bytes processed)
Samples to analyze at once: 2048
Input data: 128 kbps mp3 converted from 192/24 source data with ffmpeg
Target data: lossless 24 bit 176khz flac
Training time: depending on network complexity between 5 and 15 minutes.
192 Khz wasn't possible since it's not an integer multiple. Neural networks hate that. There is probably a way to get to that number but I cound't find a way that didn't include very complicated number crunching. This is a problem because there will be inevitably rounding offset errors that could make training more difficult. A little offset of a few samples isn't bad and even may make the model more robust to different inputs but at some point it may just confuse the model if the offset gets too big.
The easiest one is a RNN, a network with one or more hidden layers where all nodes are connected to the layer of the prior and next layer. It's rather "stupid" since it's more for data extraction, simple classification and doesn't learn any patterns or features. As you might guess, results were unexciting and nothing a resampler couldn't have done better.
Results:
The network does basically nothing to change the basic pattern of the input data, it just tries to push the signal more closly to the flac input data. Therefore, it is not fit for this purpose.
The next one and so far the most successfull was the convolutional neural network. Here we use hidden layers with some math magic that searches for patterns in the data and then feed those to another upscaling deconvolution layer that applies its own patters again to the already found patterns to create new stuff.
Results:
Now we're getting much more interesting output! Look how it smoothes the edges and tries to get back to the original curve. Still, it looks like it's just kinda smoothing the mp3 - maybe even better than what a resampler would produce - but not enough to call it a restoration since the details are still missing. Also note the phase shift. It only about 5 samples and looks bad since it's extremely zoomed in (we're looking at about 0.0004 seconds here), still it bothers me. This is why I am thinking of splitting the frequencies.
We made a network that made a reasonable reconstruction but without "dreaming up" the fine detail. For that the next neural network comes into play.
The third one, that is the weirdest but most exiting so far, is a generative adversarial network. Simply put it's nothing but crazy and genious at once. We have two networks. A generator, the creates patterns from pure random noise input (yes, really) and tries to trick another network: the discriminator. The discriminator is the other network that tries to guess wether the ouput from the generator is "valid" output or not. This produces a loss, a measurement for error, which is sent back to the generator. So basically, the generator makes stuff up from random input data until it can trick the dicriminator. This is the core of many AI gadgets like deep fakes, aging simulators, etc. you see today.
Unfortunately, adjusting the parameters is really hard for this one. Thus I didn't manage to produce any reasonable results apart from "Hm, yes, it's trying to make something." I am still looking into this in my spare time and I am confident that I will come up with something interesting after researching for a while. It's much more comlicated than the two networks above since much more can go wrong. Most of my attempts look like this:
I'll update if I got it to work!
Current main issues
- phase difference during training that may reduce the quality of the restored signal
- adjusting the training hyperparameters to produce at least a reasonable AI output curve when using GAN
Next approaches I am looking into:
- create separate, smaller models for dedicated frequency ranges to reduce the issue of phase offset
- try out Long Short Term Memory Network and see how it handles the time domain issues
- get GAN to work :C
Thank you for reading. I'll post again.
If anyone with more expertise is interesting in tackling this, I can share my Jupyter Notebooks or use Collab.
Edit: If anyone is interested in trying out my best working upscaler so far (second graph image) just send me the file and I'll throw it to the robots