Unlimited FIR taps? yes please

sarumbear · Jul 6, 2022

TriN said:
Successfully running my own GPU FIR filter, the code looks terrible but I'm gonna clean it and post it on GitHub, unfortunately vacation starts now and will be switching over to a Mac, will be able to tune it for mac users and make it cleaner, not much time until August.

Currently running 12 channels @ 64k taps @ 192khz @ 2048 samples = 1.024 Million Taps
and the load is... 4% on a Nvidia RTX 3080, 50% load for Max 8's

I should be able to bring that 50% down to almost nothing once the CPU side of things gets multi threaded (one thread per filter)

Right now this is a MAX 8 plugin, porting it to VST3 should be easy, or anything else interesting.
It is based of OpenCL instead of CUDA so should be compatible with Intel, AMD and Apple GPUs.

Max 8 has a FIR filter (buffir~) currently limited at 4096 taps max across the entire app, it is pure CPU and single threaded.
And if you don't know Max 8, you are missing out!

I wonder if you know any complex audio processor like an AVP that use audio processing on GPU?

On a similar line, if the available processing power is like you say, almost limitless, does software based AVRs like Trinnov intentionally cripple models that differentiate by number of channels they can process.

dc655321 · Jul 6, 2022

KSTR said:
With 32bit float FFT/iFFT performed on 24bit inputs, I'm often seeing "echo" type artifacts.

Was the output undithered?
I would be interested in seeing how such artifacts may be produced, if you can share.

Not too surprised to see effects at those bit-depths - 32-bit float precision is essentially 23bits (abs value).
OTOH, I would be surprised if any such effects were > -120dB.

KSTR · Jul 6, 2022

In a measurements project I had to convolve the measured (arbitrary) signal with the IR obtained from second similar measurement, using a logsweep and convolution, and vice versa.
I originally used Audition's convolver which is 32bit (and dithered to 24bit on export). Got the almost the same result with ConvolverVST which is 32bit float FFT/iFFT as well.
As I had no access to 64bit FFT/iFFT convolvers at the time, I wrote a time domain convolver in 64bit float to get rid of the artifacts, succesfully... which were IIRC in the < -120dB range and spoiled the test results a bit at first (because averaging subtractive methods have been used, digging down deep in the noise floor).

voodooless · Jul 6, 2022

sarumbear said:
On a similar line, if the available processing power is like you say, almost limitless, does software based AVRs like Trinnov intentionally cripple models that differentiate by number of channels they can process.

Possibly. If you look at the innards it’s a mix of a regular PC and a lot of DSP, FPGA and other ASIC’s. Amazingly the interface between PC and DAC-part seems to be done using an ancient technology called FireWire

. Amazing really! But I bet most of the DSP’ing is done in software though. An i7 should be able to do plenty of channels. But they could very well use a bigger CPU on the model with more channels.

sarumbear · Jul 6, 2022

voodooless said:
Possibly. If you look at the innards it’s a mix of a regular PC and a lot of DSP, FPGA and other ASIC’s. Amazingly the interface between PC and DAC-part seems to be done using an ancient technology called FireWire . Amazing really! But I bet most of the DSP’ing is done in software though. An i7 should be able to do plenty of channels. But they could very well use a bigger CPU on the model with more channels.

The reason I asked is because of OP's post. I have not heard anyone doing audio on GPU and according to the OP, there is almost limitless potential compared to a CPU.

voodooless · Jul 6, 2022

sarumbear said:
The reason I asked is because of OP's post. I have not heard anyone doing audio on GPU and according to the OP, there is almost limitless potential compared to a CPU.

Granted, there isn’t much out there. Coincidentally there was a topic just yesterday about a new company using GPU DSP processing as well:

GPU Audio - the future of DSP?

Recently I was delighted to discover the birth of this new company (link below) that is developing an audio DSP platform based on GPU (PC graphics cards). The benefit is given by the enormous parallel computing power of modern graphics cards, which have libraries and languages to let them do...

www.audiosciencereview.com

But nowadays, with things like OpenCL, it largely doesn’t matter anymore we’re you run your code. You write it once, and it will run on just about any CPU or GPU with relative efficiency.

I think one of the main reasons for not using a GPU, is that you really don’t need the limitless potential is the vast majority of cases.

gnarly · Jul 6, 2022

voodooless said:
I think one of the main reasons for not using a GPU, is that you really don’t need the limitless potential is the vast majority of cases.

That's what i'm wondering...
What is the use of more taps beyond achieving sufficient low frequency resolution?
1 Hz resolution only needs 2x the sampling rate for linear phase use. And of course even fewer taps if the FIR filter is being used for minimum phase.
Do we need more frequency resolution than 1 Hz?

Maybe it can be argued we do, if tying to enact a sharp low pass for a sub; or high-Q EQs in the sub's passband.....
But even then, I think 64K taps per channel at 48kHz, with 1.5Hz resolution is all i'd ever ask for.
Because ime, trying to EQ too fine only causes harm.

Oh, I've been say taps when I really meant to be saying "FIR time"......taps before impulse peak, divided by sampling rate.....or the delay of the FIR filter in msec/sec.
"FIR time" example: 64k taps @ 48kHz sampling = 256k taps @ 192kHz sampling....for an equal 1.5Hz frequency resolution

(I see no reason to use a higher sampling rate than 96kHz...and honestly, 48kHz is a better choice with most of todays hardware devices being tap constrained)
Just my 2c

tiramisu · Jul 6, 2022

Is DSP a parallelizable problem? I had thought the FPGAs and ASICS were the tool of choice for applying algorithms to sound.

DonH56 · Jul 6, 2022

The high cost (and power) of GPUs, especially now, has (and will) likely keep them out of mainstream AVR/AVP units. Dedicated DSPs or FPGAs seem popular. Trinnov uses standard Intel processors, no specialized DSP AFAIK. Maybe there's a market for a high-end GPU-based sound processor? If you can sell boxes of dirt for better grounding...

dc655321 · Jul 6, 2022

gnarly said:
What is the use of more taps beyond achieving sufficient low frequency resolution?

You need to have a heart-to-heart with Rob Watts and the Chord marketing department

tiramisu said:
Is DSP a parallelizable problem? I had thought the FPGAs and ASICS were the tool of choice for applying algorithms to sound.

Some aspects of DSP can benefit from parallelization, sure.
Partitioned convolution algorithms, SIMO/MIMO systems (like, line arrays), even fundamentals like FFT/IFFT calcs would benefit.

But for the typical audiophile needs, heavy parallelization is not a requirement.

sarumbear · Jul 6, 2022

DonH56 said:
The high cost (and power) of GPUs, especially now, has (and will) likely keep them out of mainstream AVR/AVP units.

Apple’s recent A series SoCs have pretty efficient GPUs that pack a punch. The M2 is even more powerful without much more power penalty.

Maybe someone can build an app that runs on an iPad Pro? USB-C should be enough to act as an I/O for a large AVP even. All you need the add is a multi-channel DAC. Or you use a digital amp.

Your UI is sorted too…

AVP: There’s an app for that!

TriN · Jul 6, 2022

Experiencing signal degradation above 24k taps/channel, it could be due to the nature of pure FIR while working from "only" 2048 samples, time domain too short?

Difficult to find answers since no one really does pure FIR that high, DSPs are limited to 8192 taps (not FFT), makes me wonder if this is because of performance issues or signal degradation above that point, likely both.

Max8's at capacity with 4096 "across all channels" so I don't think they ever considered FFT.
The amps running the JBL M2s are limited to 3500 taps (the source of the filter isn't available, I bet there is some IIR going on + linear correction with FIR)

There is a FIR Convolver from `gpu.audio` (it doesn't help us, that's for adding effects to tracks, source of effects are .wav few seconds long)
They have an hybrid approach: "partitioned convolution + synthetic", mix of pure FIR & FFT, very interesting
They also don't use OpenCL, "it's dead, no support from AMD", they use Cuda and "HIP" to convert the code to something AMD can execute.

Regarding to the latency, it's not that bad, 40 to 50ms (I was looking at the latest Epson projectors, they have video delays built in)

Do I need more taps? I mean, it sounds good at 4096.... do I need more? well, the GPU gives me a large number of channels to play with so I don't have to break 4096 into pieces.
My goal is an active crossover with 2 filters per channel, 6 channels total, without the need of linux, Rpi4 or maxing out the CPU... computers nowadays have GPUs, integrated or not so let's try to use that power.

Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.

DonH56 · Jul 6, 2022

TriN said:
Experiencing signal degradation above 24k taps/channel, it could be due to the nature of pure FIR while working from "only" 2048 samples, time domain too short?

Difficult to find answers since no one really does pure FIR that high, DSPs are limited to 8192 taps (not FFT), makes me wonder if this is because of performance issues or signal degradation above that point, likely both.

Max8's at capacity with 4096 "across all channels" so I don't think they ever considered FFT.
The amps running the JBL M2s are limited to 3500 taps (the source of the filter isn't available, I bet there is some IIR going on + linear correction with FIR)

There is a FIR Convolver from `gpu.audio` (it doesn't help us, that's for adding effects to tracks, source of effects are .wav few seconds long)
They have an hybrid approach: "partitioned convolution + synthetic", mix of pure FIR & FFT, very interesting
They also don't use OpenCL, "it's dead, no support from AMD", they use Cuda and "HIP" to convert the code to something AMD can execute.

Regarding to the latency, it's not that bad, 40 to 50ms (I was looking at the latest Epson projectors, they have video delays built in)

Do I need more taps? I mean, it sounds good at 4096.... do I need more? well, the GPU gives me a large number of channels to play with so I don't have to break 4096 into pieces.
My goal is an active crossover with 2 filters per channel, 6 channels total, without the need of linux, Rpi4 or maxing out the CPU... computers nowadays have GPUs, integrated or not so let's try to use that power.

Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.

Video delays? Nice, have not seen that, at least that I recall. Usually audio is delayed on the assumption that video processing takes longer. Maybe that is no longer true with the latest multichannel lossless codecs? I think the AVRs I have in-house have max audio delays around 50~100 ms (not sure, have not looked in a while, and that info tends to be hard to find) and no video delays.

dc655321 · Jul 6, 2022

TriN said:
Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.

You keep using the phrase, “pure FIR”.
AFAIK, that’s not a thing.

So, what do you mean? Time-domain convolution (aka linear convolution)?

gnarly · Jul 7, 2022

TriN said:
Anyone having experience with pure FIR and large number of taps? I'm starting to believe that a greater number gives me a much lower signal quality so if FFT is the solution, I'll give it a try.

Hi, let me echo dc655321's question....what do you mean by 'pure FIR' ?
And what do you mean by lower signal quality? How determined?

I've run a number of FIR setups from 4k taps per channel, to 65k taps per channel. Mostly in linear phase mode.
My experience has been that sound quality improves with more taps until a sufficient level of frequency resolution occurs, and then more taps don't add anything.
I think proper implementation of FIR filters, particularly if they are linear phase, matters more to sound quality than the number of taps, until low frequency.

I'm currently using a Q-Sys Core 500i hardware processor configured to run 15 channels of 16k taps per channel @ 48kHz. (It's for 5-way speaker in a LCR setup.)
Bought it used for $1200..... hardware does exist...
A 6yr old rasPi worked fine with 8 ch of 65k taps @ 48kHz...i can't image how many taps a current PC CPU can run...

So I'm really having a hard time understanding why a GPU would be needed, either computationally; or as per my prior post, pragmatically in terms of how much FIR time is really needed.
Sorry, I'm not getting where you're coming from.....

AwesomeSauce2015 · Jul 7, 2022

subscribed.
Not an expert on FIR filters but I struggle to see a situation in which the hundreds of cores of a (small) modern GPU would be properly utilized.
I mean, most modern computer CPUs are at least 4 cores - with intel's new i9-12900 having many more.
A RTX 3080 has many thousands of compute cores.

While I definitely think this is a cool idea which may work for numerous filters and/or fully correcting a large audio system, I just don't see the need per-se for a GPU-based DSP engine over dedicated hardware, which would probably be preferred in a large and complex setup.
The only real benefit of GPU over CPU - in a dedicated box - is the higher speed memory access (with high end GPUs) - And even then iGPUs don't get that. If you are trying to run this on top of windows or some other OS, you'll probably have a lot of problems with latency due to process scheduling... Not to mention the problem of getting the signal into the system.

TriN · Jul 7, 2022

pure fir, brute firce... is a loop for the samples, then another loop within for the taps, some shifting... that's the part that takes a lot of time, it's the most basic and accurate way.
2048 samples * x taps doesn't seem a lot to compute.... but at 192khz those samples are coming hot!

Then you have many other ways, FFT, cuFFT, FTWW, FTTS, FFTE, FTTW3 and dozens more, all based on the same idea, but with modification and workaround to go faster, all have a different output / error rate, some have single precision float vs double, it doesn't have to be perfect, they are mostly used for 2 and 3D.

The GPU way explodes that loop(loop()) into 2 dimensions across ~8000 threads, better than a couple, right?
It also gives me "atomic" functions to save the result of these thousand threads into the output stream, accessing x while doing (x = x + a) across many threads doesn't add up, atomic functions are here to help, giving an extra dimension, aka more speed.

By lower quality, it feels like the signal is duplicating itself, stretching / rubber banding, at 100k+ taps it becomes a pain to listen to.
Since the "pure fir" high end DSPs available on the market have more or less 8k taps, no one really experienced a lot more taps to say... nah that's stupid don't do it.

dc655321 · Jul 7, 2022

TriN said:
pure fir, brute firce... is a loop for the samples, then another loop within for the taps, some shifting... that's the part that takes a lot of time, it's the most basic and accurate way.
2048 samples * x taps doesn't seem a lot to compute.... but at 192khz those samples are coming hot!

Then you have many other ways, FFT, cuFFT, FTWW, FTTS, FFTE, FTTW3 and dozens more, all based on the same idea, but with modification and workaround to go faster, all have a different output / error rate, some have single precision float vs double, it doesn't have to be perfect, they are mostly used for 2 and 3D.

The GPU way explodes that loop(loop()) into 2 dimensions across ~8000 threads, better than a couple, right?
It also gives me "atomic" functions to save the result of these thousand threads into the output stream, accessing x while doing (x = x + a) across many threads doesn't add up, atomic functions are here to help, giving an extra dimension, aka more speed.

By lower quality, it feels like the signal is duplicating itself, stretching / rubber banding, at 100k+ taps it becomes a pain to listen to.
Since the "pure fir" high end DSPs available on the market have more or less 8k taps, no one really experienced a lot more taps to say... nah that's stupid don't do it.

How about WTF? You forgot that one.

thorvat · Jul 7, 2022

gnarly said:
.
My experience has been that sound quality improves with more taps until a sufficient level of frequency resolution occurs, and then more taps don't add anything.

Actually they do.. They add more delay.

TriN · Jul 7, 2022

dc655321 said:
How about WTF? You forgot that one.

current state right now, lol

Unlimited FIR taps? yes please

Master Contributor

Major Contributor

Major Contributor

Grand Contributor

Master Contributor

Grand Contributor

Major Contributor

Member

Master Contributor

Major Contributor

Master Contributor

Member

Master Contributor

Major Contributor

Major Contributor

Active Member

Member

Major Contributor

Senior Member

Member

Similar threads