• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Unlimited FIR taps? yes please

sarumbear

Master Contributor
Forum Donor
Joined
Aug 15, 2020
Messages
7,604
Likes
7,324
Location
UK
Successfully running my own GPU FIR filter, the code looks terrible but I'm gonna clean it and post it on GitHub, unfortunately vacation starts now and will be switching over to a Mac, will be able to tune it for mac users and make it cleaner, not much time until August.

Currently running 12 channels @ 64k taps @ 192khz @ 2048 samples = 1.024 Million Taps
and the load is... 4% on a Nvidia RTX 3080, 50% load for Max 8's

I should be able to bring that 50% down to almost nothing once the CPU side of things gets multi threaded (one thread per filter)

Right now this is a MAX 8 plugin, porting it to VST3 should be easy, or anything else interesting.
It is based of OpenCL instead of CUDA so should be compatible with Intel, AMD and Apple GPUs.

Max 8 has a FIR filter (buffir~) currently limited at 4096 taps max across the entire app, it is pure CPU and single threaded.
And if you don't know Max 8, you are missing out!
I wonder if you know any complex audio processor like an AVP that use audio processing on GPU?

On a similar line, if the available processing power is like you say, almost limitless, does software based AVRs like Trinnov intentionally cripple models that differentiate by number of channels they can process.
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
With 32bit float FFT/iFFT performed on 24bit inputs, I'm often seeing "echo" type artifacts.

Was the output undithered?
I would be interested in seeing how such artifacts may be produced, if you can share.

Not too surprised to see effects at those bit-depths - 32-bit float precision is essentially 23bits (abs value).
OTOH, I would be surprised if any such effects were > -120dB.
 

KSTR

Major Contributor
Joined
Sep 6, 2018
Messages
2,776
Likes
6,212
Location
Berlin, Germany
In a measurements project I had to convolve the measured (arbitrary) signal with the IR obtained from second similar measurement, using a logsweep and convolution, and vice versa.
I originally used Audition's convolver which is 32bit (and dithered to 24bit on export). Got the almost the same result with ConvolverVST which is 32bit float FFT/iFFT as well.
As I had no access to 64bit FFT/iFFT convolvers at the time, I wrote a time domain convolver in 64bit float to get rid of the artifacts, succesfully... which were IIRC in the < -120dB range and spoiled the test results a bit at first (because averaging subtractive methods have been used, digging down deep in the noise floor).
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,403
Likes
18,363
Location
Netherlands
On a similar line, if the available processing power is like you say, almost limitless, does software based AVRs like Trinnov intentionally cripple models that differentiate by number of channels they can process.
Possibly. If you look at the innards it’s a mix of a regular PC and a lot of DSP, FPGA and other ASIC’s. Amazingly the interface between PC and DAC-part seems to be done using an ancient technology called FireWire :eek:. Amazing really! But I bet most of the DSP’ing is done in software though. An i7 should be able to do plenty of channels. But they could very well use a bigger CPU on the model with more channels.
 

sarumbear

Master Contributor
Forum Donor
Joined
Aug 15, 2020
Messages
7,604
Likes
7,324
Location
UK
Possibly. If you look at the innards it’s a mix of a regular PC and a lot of DSP, FPGA and other ASIC’s. Amazingly the interface between PC and DAC-part seems to be done using an ancient technology called FireWire :eek:. Amazing really! But I bet most of the DSP’ing is done in software though. An i7 should be able to do plenty of channels. But they could very well use a bigger CPU on the model with more channels.
The reason I asked is because of OP's post. I have not heard anyone doing audio on GPU and according to the OP, there is almost limitless potential compared to a CPU.
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,403
Likes
18,363
Location
Netherlands
The reason I asked is because of OP's post. I have not heard anyone doing audio on GPU and according to the OP, there is almost limitless potential compared to a CPU.
Granted, there isn’t much out there. Coincidentally there was a topic just yesterday about a new company using GPU DSP processing as well:


But nowadays, with things like OpenCL, it largely doesn’t matter anymore we’re you run your code. You write it once, and it will run on just about any CPU or GPU with relative efficiency.

I think one of the main reasons for not using a GPU, is that you really don’t need the limitless potential is the vast majority of cases.
 

gnarly

Major Contributor
Joined
Jun 15, 2021
Messages
1,034
Likes
1,469
I think one of the main reasons for not using a GPU, is that you really don’t need the limitless potential is the vast majority of cases.
That's what i'm wondering...
What is the use of more taps beyond achieving sufficient low frequency resolution?
1 Hz resolution only needs 2x the sampling rate for linear phase use. And of course even fewer taps if the FIR filter is being used for minimum phase.
Do we need more frequency resolution than 1 Hz?

Maybe it can be argued we do, if tying to enact a sharp low pass for a sub; or high-Q EQs in the sub's passband.....
But even then, I think 64K taps per channel at 48kHz, with 1.5Hz resolution is all i'd ever ask for.
Because ime, trying to EQ too fine only causes harm.

Oh, I've been say taps when I really meant to be saying "FIR time"......taps before impulse peak, divided by sampling rate.....or the delay of the FIR filter in msec/sec.
"FIR time" example: 64k taps @ 48kHz sampling = 256k taps @ 192kHz sampling....for an equal 1.5Hz frequency resolution

(I see no reason to use a higher sampling rate than 96kHz...and honestly, 48kHz is a better choice with most of todays hardware devices being tap constrained)
Just my 2c
 

tiramisu

Member
Joined
Jun 24, 2022
Messages
98
Likes
101
Is DSP a parallelizable problem? I had thought the FPGAs and ASICS were the tool of choice for applying algorithms to sound.
 

DonH56

Master Contributor
Technical Expert
Forum Donor
Joined
Mar 15, 2016
Messages
7,901
Likes
16,716
Location
Monument, CO
The high cost (and power) of GPUs, especially now, has (and will) likely keep them out of mainstream AVR/AVP units. Dedicated DSPs or FPGAs seem popular. Trinnov uses standard Intel processors, no specialized DSP AFAIK. Maybe there's a market for a high-end GPU-based sound processor? If you can sell boxes of dirt for better grounding...
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
What is the use of more taps beyond achieving sufficient low frequency resolution?

You need to have a heart-to-heart with Rob Watts and the Chord marketing department ;)

Is DSP a parallelizable problem? I had thought the FPGAs and ASICS were the tool of choice for applying algorithms to sound.

Some aspects of DSP can benefit from parallelization, sure.
Partitioned convolution algorithms, SIMO/MIMO systems (like, line arrays), even fundamentals like FFT/IFFT calcs would benefit.

But for the typical audiophile needs, heavy parallelization is not a requirement.
 

sarumbear

Master Contributor
Forum Donor
Joined
Aug 15, 2020
Messages
7,604
Likes
7,324
Location
UK
The high cost (and power) of GPUs, especially now, has (and will) likely keep them out of mainstream AVR/AVP units.
Apple’s recent A series SoCs have pretty efficient GPUs that pack a punch. The M2 is even more powerful without much more power penalty.

Maybe someone can build an app that runs on an iPad Pro? USB-C should be enough to act as an I/O for a large AVP even. All you need the add is a multi-channel DAC. Or you use a digital amp.

Your UI is sorted too…

AVP: There’s an app for that! :)
 
OP
T

TriN

Member
Joined
Jun 26, 2022
Messages
23
Likes
39
Experiencing signal degradation above 24k taps/channel, it could be due to the nature of pure FIR while working from "only" 2048 samples, time domain too short?

Difficult to find answers since no one really does pure FIR that high, DSPs are limited to 8192 taps (not FFT), makes me wonder if this is because of performance issues or signal degradation above that point, likely both.

Max8's at capacity with 4096 "across all channels" so I don't think they ever considered FFT.
The amps running the JBL M2s are limited to 3500 taps (the source of the filter isn't available, I bet there is some IIR going on + linear correction with FIR)

There is a FIR Convolver from `gpu.audio` (it doesn't help us, that's for adding effects to tracks, source of effects are .wav few seconds long)
They have an hybrid approach: "partitioned convolution + synthetic", mix of pure FIR & FFT, very interesting
They also don't use OpenCL, "it's dead, no support from AMD", they use Cuda and "HIP" to convert the code to something AMD can execute.

Regarding to the latency, it's not that bad, 40 to 50ms (I was looking at the latest Epson projectors, they have video delays built in)

Do I need more taps? I mean, it sounds good at 4096.... do I need more? well, the GPU gives me a large number of channels to play with so I don't have to break 4096 into pieces.
My goal is an active crossover with 2 filters per channel, 6 channels total, without the need of linux, Rpi4 or maxing out the CPU... computers nowadays have GPUs, integrated or not so let's try to use that power.

Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.
 

DonH56

Master Contributor
Technical Expert
Forum Donor
Joined
Mar 15, 2016
Messages
7,901
Likes
16,716
Location
Monument, CO
Experiencing signal degradation above 24k taps/channel, it could be due to the nature of pure FIR while working from "only" 2048 samples, time domain too short?

Difficult to find answers since no one really does pure FIR that high, DSPs are limited to 8192 taps (not FFT), makes me wonder if this is because of performance issues or signal degradation above that point, likely both.

Max8's at capacity with 4096 "across all channels" so I don't think they ever considered FFT.
The amps running the JBL M2s are limited to 3500 taps (the source of the filter isn't available, I bet there is some IIR going on + linear correction with FIR)

There is a FIR Convolver from `gpu.audio` (it doesn't help us, that's for adding effects to tracks, source of effects are .wav few seconds long)
They have an hybrid approach: "partitioned convolution + synthetic", mix of pure FIR & FFT, very interesting
They also don't use OpenCL, "it's dead, no support from AMD", they use Cuda and "HIP" to convert the code to something AMD can execute.

Regarding to the latency, it's not that bad, 40 to 50ms (I was looking at the latest Epson projectors, they have video delays built in)

Do I need more taps? I mean, it sounds good at 4096.... do I need more? well, the GPU gives me a large number of channels to play with so I don't have to break 4096 into pieces.
My goal is an active crossover with 2 filters per channel, 6 channels total, without the need of linux, Rpi4 or maxing out the CPU... computers nowadays have GPUs, integrated or not so let's try to use that power.

Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.

Video delays? Nice, have not seen that, at least that I recall. Usually audio is delayed on the assumption that video processing takes longer. Maybe that is no longer true with the latest multichannel lossless codecs? I think the AVRs I have in-house have max audio delays around 50~100 ms (not sure, have not looked in a while, and that info tends to be hard to find) and no video delays.
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.

You keep using the phrase, “pure FIR”.
AFAIK, that’s not a thing.

So, what do you mean? Time-domain convolution (aka linear convolution)?
 

gnarly

Major Contributor
Joined
Jun 15, 2021
Messages
1,034
Likes
1,469
Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.
Hi, let me echo dc655321's question....what do you mean by 'pure FIR' ?
And what do you mean by lower signal quality? How determined?

I've run a number of FIR setups from 4k taps per channel, to 65k taps per channel. Mostly in linear phase mode.
My experience has been that sound quality improves with more taps until a sufficient level of frequency resolution occurs, and then more taps don't add anything.
I think proper implementation of FIR filters, particularly if they are linear phase, matters more to sound quality than the number of taps, until low frequency.

I'm currently using a Q-Sys Core 500i hardware processor configured to run 15 channels of 16k taps per channel @ 48kHz. (It's for 5-way speaker in a LCR setup.)
Bought it used for $1200..... hardware does exist...
A 6yr old rasPi worked fine with 8 ch of 65k taps @ 48kHz...i can't image how many taps a current PC CPU can run...

So I'm really having a hard time understanding why a GPU would be needed, either computationally; or as per my prior post, pragmatically in terms of how much FIR time is really needed.
Sorry, I'm not getting where you're coming from.....
 

AwesomeSauce2015

Active Member
Joined
Apr 14, 2022
Messages
205
Likes
195
subscribed.
Not an expert on FIR filters but I struggle to see a situation in which the hundreds of cores of a (small) modern GPU would be properly utilized.
I mean, most modern computer CPUs are at least 4 cores - with intel's new i9-12900 having many more.
A RTX 3080 has many thousands of compute cores.

While I definitely think this is a cool idea which may work for numerous filters and/or fully correcting a large audio system, I just don't see the need per-se for a GPU-based DSP engine over dedicated hardware, which would probably be preferred in a large and complex setup.
The only real benefit of GPU over CPU - in a dedicated box - is the higher speed memory access (with high end GPUs) - And even then iGPUs don't get that. If you are trying to run this on top of windows or some other OS, you'll probably have a lot of problems with latency due to process scheduling... Not to mention the problem of getting the signal into the system.
 
OP
T

TriN

Member
Joined
Jun 26, 2022
Messages
23
Likes
39
pure fir, brute firce... is a loop for the samples, then another loop within for the taps, some shifting... that's the part that takes a lot of time, it's the most basic and accurate way.
2048 samples * x taps doesn't seem a lot to compute.... but at 192khz those samples are coming hot!

Then you have many other ways, FFT, cuFFT, FTWW, FTTS, FFTE, FTTW3 and dozens more, all based on the same idea, but with modification and workaround to go faster, all have a different output / error rate, some have single precision float vs double, it doesn't have to be perfect, they are mostly used for 2 and 3D.

The GPU way explodes that loop(loop()) into 2 dimensions across ~8000 threads, better than a couple, right?
It also gives me "atomic" functions to save the result of these thousand threads into the output stream, accessing x while doing (x = x + a) across many threads doesn't add up, atomic functions are here to help, giving an extra dimension, aka more speed.

By lower quality, it feels like the signal is duplicating itself, stretching / rubber banding, at 100k+ taps it becomes a pain to listen to.
Since the "pure fir" high end DSPs available on the market have more or less 8k taps, no one really experienced a lot more taps to say... nah that's stupid don't do it.
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
pure fir, brute firce... is a loop for the samples, then another loop within for the taps, some shifting... that's the part that takes a lot of time, it's the most basic and accurate way.
2048 samples * x taps doesn't seem a lot to compute.... but at 192khz those samples are coming hot!

Then you have many other ways, FFT, cuFFT, FTWW, FTTS, FFTE, FTTW3 and dozens more, all based on the same idea, but with modification and workaround to go faster, all have a different output / error rate, some have single precision float vs double, it doesn't have to be perfect, they are mostly used for 2 and 3D.

The GPU way explodes that loop(loop()) into 2 dimensions across ~8000 threads, better than a couple, right?
It also gives me "atomic" functions to save the result of these thousand threads into the output stream, accessing x while doing (x = x + a) across many threads doesn't add up, atomic functions are here to help, giving an extra dimension, aka more speed.

By lower quality, it feels like the signal is duplicating itself, stretching / rubber banding, at 100k+ taps it becomes a pain to listen to.
Since the "pure fir" high end DSPs available on the market have more or less 8k taps, no one really experienced a lot more taps to say... nah that's stupid don't do it.

How about WTF? You forgot that one.
 

thorvat

Senior Member
Joined
Aug 9, 2021
Messages
323
Likes
387
.
My experience has been that sound quality improves with more taps until a sufficient level of frequency resolution occurs, and then more taps don't add anything.

Actually they do.. They add more delay. :D
 
Top Bottom