• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Unlimited FIR taps? yes please

TriN

Member
Joined
Jun 26, 2022
Messages
23
Likes
39
Successfully running my own GPU FIR filter, the code looks terrible but I'm gonna clean it and post it on GitHub, unfortunately vacation starts now and will be switching over to a Mac, will be able to tune it for mac users and make it cleaner, not much time until August.

Currently running 12 channels @ 64k taps @ 192khz @ 2048 samples = 1.024 Million Taps
and the load is... 4% on a Nvidia RTX 3080, 50% load for Max 8's

I should be able to bring that 50% down to almost nothing once the CPU side of things gets multi threaded (one thread per filter)

Right now this is a MAX 8 plugin, porting it to VST3 should be easy, or anything else interesting.
It is based of OpenCL instead of CUDA so should be compatible with Intel, AMD and Apple GPUs.

Max 8 has a FIR filter (buffir~) currently limited at 4096 taps max across the entire app, it is pure CPU and single threaded.
And if you don't know Max 8, you are missing out!
 

ppataki

Major Contributor
Joined
Aug 7, 2019
Messages
1,216
Likes
1,355
Location
Budapest
What's your use case?
I guess it would help me tame these beasts:

 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,223
Likes
17,799
Location
Netherlands
Successfully running my own GPU FIR filter
Very cool :cool:
Currently running 12 channels @ 64k taps @ 192khz @ 2048 samples = 1.024 Million Taps
and the load is... 4% on a Nvidia RTX 3080, 50% load for Max 8's
That’s pretty neet! Though I feel like there should be more performance? Compare for instance CamilaDSP on a Pi4: It does 8 channel @ 262k taps @ 192 kHz with about 55% CPU. That makes the GPU about 5x faster than the Pi CPU’s. Somehow that sounds not that much for a high-end GPU? A AMD Ryzen 7 2700u claims to do 96 channels with the same number of taps with CamilaDSP.

Did you do any profiling on where most of the performance is lost/gained? My guess would be that the bulk of the performance is lost with copying data from and to GPU? If that is the case and you’ll switch to a M1/M2 Mac, you’ll profit from unified memory, and can skip most of the copying.

Still being able to free up your CPU for other valuable tasks is extremely useful.
And if you don't know Max 8, you are missing out!
That looks way cool!
 
OP
T

TriN

Member
Joined
Jun 26, 2022
Messages
23
Likes
39
Very cool :cool:

That’s pretty neet! Though I feel like there should be more performance? Compare for instance CamilaDSP on a Pi4: It does 8 channel @ 262k taps @ 192 kHz with about 55% CPU. That makes the GPU about 5x faster than the Pi CPU’s. Somehow that sounds not that much for a high-end GPU? A AMD Ryzen 7 2700u claims to do 96 channels with the same number of taps with CamilaDSP.

Did you do any profiling on where most of the performance is lost/gained? My guess would be that the bulk of the performance is lost with copying data from and to GPU? If that is the case and you’ll switch to a M1/M2 Mac, you’ll profit from unified memory, and can skip most of the copying.

Still being able to free up your CPU for other valuable tasks is extremely useful.

That looks way cool!
CamillaDSP does FIR very differently "while FIR use convolution via FFT/IFFT".
Technically, that's not FIR, as least not done the correct way.
FFT is a lot faster, skipping, rounding, cutting... adding, merging.... the output doesn't look too good, no wonder why the Pi4 is that fast.
(professional rack mounted DSPs don't use FFT because they "click", they have a limited number of taps, less than 8k across the system)

There isn't much being copied to the GPU, it's signal in -> signal out, everything needed is stored within the GPU.
For some reason the module itself shares Max8' resources, I'm not sure why, I need to dig more into the threading options.

I'm looking for the most elegant solution, CamillaDSP is a lot of work, separated into many projects and dependencies... and it doesn't do FIR the right way.
I like Max8' interface & signal routing, anyone should be able to drag and drop the module, connect some wires, hit play, done.
Adding few subs? adding another FIR for room correction before or after the crossover? few clicks, it's that easy.
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,223
Likes
17,799
Location
Netherlands
CamillaDSP does FIR very differently "while FIR use convolution via FFT/IFFT".
Technically, that's not FIR, as least not done the correct way.
FFT is a lot faster, skipping, rounding, cutting... adding, merging.... the output doesn't look too good, no wonder why the Pi4 is that fast.
Do you have examples of the output not looking that good? I would be very interested in seeing the difference.
(professional rack mounted DSPs don't use FFT because they "click", they have a limited number of taps, less than 8k across the system)
Well, just about no DSP hardware contains the needed FFT/IFFT blocks, so no wonder it's not a common practice.
There isn't much being copied to the GPU, it's signal in -> signal out, everything needed is stored within the GPU.
You need to get the samples in and out, don't you?
I like Max8' interface & signal routing, anyone should be able to drag and drop the module, connect some wires, hit play, done.
Adding few subs? adding another FIR for room correction before or after the crossover? few clicks, it's that easy.
It really looks very interesting! The description (or rather the lack of one) on the website doesn't seem to do it any justice though (edit: looks like the desktop version is much better in regards to information :) )... And don't forget: CamilaDSP is totally free, and Max 8 is not. Still, I'll have a closer look.

I especially like that you used OpenCL. That will make porting to the various platforms very easy.
 
Last edited:
OP
T

TriN

Member
Joined
Jun 26, 2022
Messages
23
Likes
39
Do you have examples of the output not looking that good? I would be very interested in seeing the difference.
When "not enough" taps are set it gets ugly, it doesn't generate both ends of the "sample" correctly so when the next one merges over you can hear cracks, having more taps solves this issue.
I might try NUFFT above 200k taps, running the GPU at 100% actually (not the 4% reported by Windows :/ ) isn't sustainable.
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,223
Likes
17,799
Location
Netherlands
When "not enough" taps are set it gets ugly, it doesn't generate both ends of the "sample" correctly so when the next one merges over you can hear cracks, having more taps solves this issue.
I guess if there were real issues with this in CamilaDSP, there would be reports of this? Seems like a simple problem to solve at least.
 

Drengur

Active Member
Forum Donor
Joined
Aug 17, 2021
Messages
150
Likes
412
Subscribed. I had been wondering the exact same thing for a while; why not use a GPU? Is it necessary? Probably not. Is it cool? Very.
 

DonH56

Master Contributor
Technical Expert
Forum Donor
Joined
Mar 15, 2016
Messages
7,834
Likes
16,496
Location
Monument, CO
GPUs are an obvious choice, since they are used for so many other things in parallel-processing land. Still, I wonder about latency... Not an issue for music listening, may be an issue for gamers? For video, how many taps can you use before you can no longer keep audio and video in synch?
 

nc535

Member
Joined
Dec 30, 2021
Messages
52
Likes
62
pardon the possibly naive question but why use FFT and iFFT for convolution when time domain convolution itself is so simple?
 

gnarly

Addicted to Fun and Learning
Joined
Jun 15, 2021
Messages
989
Likes
1,390
Successfully running my own GPU FIR filter, the code looks terrible but I'm gonna clean it and post it on GitHub, unfortunately vacation starts now and will be switching over to a Mac, will be able to tune it for mac users and make it cleaner, not much time until August.

Currently running 12 channels @ 64k taps @ 192khz @ 2048 samples = 1.024 Million Taps
Neat project !

Would you explain the math behind =1.024 Million Taps ?

I would think the total tap count needed would be the tap count per channel x number of channels, or 64k taps x 12ch, or 768,000 total taps.
I've read elsewhere that ballpark MIPS required per channel is taps x sampling rate, which for this example I would have thought would be 64k x 192kHz, or 12,288 MIPS.
And I don't know how 2048 samples fits into your equation....

I explained my understandings because I know they are rudimentary with regards to computational issues....
thought that might help you answer what i'm missing, thx!
 

gnarly

Addicted to Fun and Learning
Joined
Jun 15, 2021
Messages
989
Likes
1,390
pardon the possibly naive question but why use FFT and iFFT for convolution when time domain convolution itself is so simple?
Aaaah, i think that answers my wondering how 2048 samples fit in...

Good question !!
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,223
Likes
17,799
Location
Netherlands
pardon the possibly naive question but why use FFT and iFFT for convolution when time domain convolution itself is so simple?
Because it's generally faster, well at least in many cases it's faster. That is why the Pi can do so many taps with its humble hardware.

One could easily do this on a GPU as well: https://clmathlibraries.github.io/clFFT/
 

pkane

Master Contributor
Forum Donor
Joined
Aug 18, 2017
Messages
5,630
Likes
10,203
Location
North-East
pardon the possibly naive question but why use FFT and iFFT for convolution when time domain convolution itself is so simple?

FFT is much faster for larger filter kernels. Unoptimized time domain convolution takes N*N operations (N being the size of the kernel), unoptimized FFT takes N*Log(N) operations.

For filter size 2048, time domain convolution takes 2048*2048=4,194,304 operations. FFT only takes 2048*Log(2048) = 311,296. The difference becomes huge as filter tap sizes increase to millions.
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
CamillaDSP does FIR very differently "while FIR use convolution via FFT/IFFT".
Technically, that's not FIR, as least not done the correct way.
FFT is a lot faster, skipping, rounding, cutting... adding, merging.... the output doesn't look too good, no wonder why the Pi4 is that fast.
(professional rack mounted DSPs don't use FFT because they "click", they have a limited number of taps, less than 8k across the system)

It’s difficult to understand what you’re saying here. One does not “do” FIR. There is nothing incorrect or “very different” in how CamillaDSP operates.

This smells a bit like misunderstanding…
 

Martin

Major Contributor
Forum Donor
Joined
Mar 23, 2018
Messages
1,895
Likes
5,536
Location
Cape Coral, FL
Successfully running my own GPU FIR filter, the code looks terrible but I'm gonna clean it and post it on GitHub, unfortunately vacation starts now and will be switching over to a Mac, will be able to tune it for mac users and make it cleaner, not much time until August.

Currently running 12 channels @ 64k taps @ 192khz @ 2048 samples = 1.024 Million Taps
and the load is... 4% on a Nvidia RTX 3080, 50% load for Max 8's

I should be able to bring that 50% down to almost nothing once the CPU side of things gets multi threaded (one thread per filter)

Right now this is a MAX 8 plugin, porting it to VST3 should be easy, or anything else interesting.
It is based of OpenCL instead of CUDA so should be compatible with Intel, AMD and Apple GPUs.

Max 8 has a FIR filter (buffir~) currently limited at 4096 taps max across the entire app, it is pure CPU and single threaded.
And if you don't know Max 8, you are missing out!

I don't know what any of this means. Where can I find more information about it and its use?

Martin
 

KSTR

Major Contributor
Joined
Sep 6, 2018
Messages
2,690
Likes
6,013
Location
Berlin, Germany
Technically, that's not FIR, as least not done the correct way.
IME, when done in high precision (>= 64bit floats), FFT/iFFT-based convolution is totally equivalent to doing it directly in the time domain (and again, >= 64bit floats). Numeric artifacts are low enough to be practically irrelevant... however in direct time domain convolution they are lowest. Computation time is highest for everthing but the smallest kernels.

With 32bit float FFT/iFFT performed on 24bit inputs, I'm often seeing "echo" type artifacts. Not necessarily audible when done for final speaker outputs, though.
Fixed point implementations might be another can of worms but I admit I have little experience with these.
 
Top Bottom