• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Machine learning and digital audio

Tokyo_John

Active Member
Forum Donor
Joined
Mar 6, 2021
Messages
214
Likes
289
In recent years machine learning (e.g., training neural networks) has become common in many domains. I use it in my own research, I have a graduate student working on a project to determine the travel time of seismic waves traveling through Earth’s interior in order to expand our dataset beyond what is possible using hand-picked data (but trained on hand-picked data). This data will be used for imaging our planet’s interior. He also wants to use this approach to search for unusual seismic events in big data sets but which is impossible to do without computationally tractable methods. I have also watched with interest as digital image processing has embraced machine learning approaches, with applications ranging from “super-resolution” to “de-noise” filtering. A company called Topaz AI has taken this up and their products are growing in popularity among photographers.

So I was naturally curious to see whether these approaches have been employed in digital audio...and of course they have. Speech recognition in digital recordings is a huge application, and is developed into an entire industry. But what about hi-if audio? How about common applications such as upsampling? It seems like neural networks could be an efficient way to perform these and other routine audio processing tasks (certainly more efficient than the M Scaler).

I found some interesting research on upsampling, along with some code and examples:
https://kuleshov.github.io/audio-super-res/

Very interesting work, here is the summary of the findings:

Machine learning algorithms are only as good as their training data. If you want to apply our method to your personal recordings, you will most likely need to collect additional labeled examples.

Interestingly, super-resolution works better on aliased input (no low-pass filter). This is not reflected well in objective benchmarks, but is noticeable when listening to the samples. For applications like compression (where you control the low-res signal), this may be important.

More generally, the model is very sensitive to how low resolution samples are generated. Even using a different low-pass filter (Butterworth, Chebyshev) at test time will reduce performance.

In any case, I thought it may be interesting to query the ASR community and see what else is happening in machine learning with applications to hi-fi. Anyone know of interesting work or other potential applications?
 

_thelaughingman

Major Contributor
Forum Donor
Joined
Jan 1, 2020
Messages
1,324
Likes
1,943
Interesting application of neural network and machine learning. Definitely would be great to have an algorithm that can detect anomalies in dynamic range and resolve it or preserve the dynamics.
 
Last edited:
OP
Tokyo_John

Tokyo_John

Active Member
Forum Donor
Joined
Mar 6, 2021
Messages
214
Likes
289
I’ve been thinking about more potential applications:
-Analogue simulation: teach a neural net to reproduce sounds characteristic of certain analogue media such as vinyl, tape, etc.. I have seen ML film simulations in digital photography that are pretty close to the real look, and this kind of application is relative straightforward.
-Speaker simulation: Want to emulate the sound of a classic model or specific kind of speaker while using another speaker? ML is a straightforward
-Room correction/manipulation/equalization: If one can tell the algorithm how a room how the room is supposed to sound and compare that with how it actually sounds, the neural net can make the correction.
-Filtering/smoothing artifacts: Another kind of application is dithering to address aliasing, de-noising, filtering out artifacts, etc.. As keebz28 suggested, dynamic range variation can also be manipulated.
-File compression: We are not fans of compression, however, it may be a way to leverage higher quality results for a given bandwidth than is achievable by existing algorithms (e.g., what MQA promised to do, but which many are skeptical of).

As I mentioned above, we also have:
-Upscaling/interpolation: Training a neural net using high quality+quantity training data should be a very good way of sampling to higher rates and bit-depths while avoiding aliasing and other artifacts.

I can imagine that the neural net will need to be trained differently for different types of music (piano vs symphony vs jazz quartet vs metal vs etc.), different recording environments (concert hall, studio, etc.), and other factors. A one-neural-net-fits-all approach seems unlikely to work well in comparison to cases where one has dialed in the specific characteristics of the recording.

In principle, anything one can train the algorithm to do with target data...can be done. This may not have been developed yet to a high level in hi-fi, but I can easily predict that it is coming...eventually.
 
Last edited:

Wes

Major Contributor
Forum Donor
Joined
Dec 5, 2019
Messages
3,843
Likes
3,788
I just run everything by a sushi chef


Kyoko head tilt listening while cutting fish.png
 

Wes

Major Contributor
Forum Donor
Joined
Dec 5, 2019
Messages
3,843
Likes
3,788
it's a movie still
 

Wes

Major Contributor
Forum Donor
Joined
Dec 5, 2019
Messages
3,843
Likes
3,788
Heh, in all my years in Japan, never once ran across a sushi chef who looks like that...they are mostly leathery, brusque, sour-faced old men wielding razor sharp knives.

and BTW she is half-Japanese, 1/4 Argentinean, and (without checking her teeth) 1/4 English

the still is from Ex Machina,
 
Top Bottom