• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

DIY GPU DSP/DDC Box - Looking for opinions

michihito

Member
Joined
Dec 23, 2025
Messages
14
Likes
19
Location
Japan
Hi ASR!

Real-Time GPU Power for Your Audio: Introducing a New GPU DSP/DDC Project

I’m thrilled to share my project I’ve been developing that brings massive GPU computing power into the real-time audio signal path.
The concept is a high-performance GPU-based DSP/DDC engine designed to redefine what’s possible with PCM processing.

What is this project?
It is a real-time PCM processing solution that leverages the parallel computing power of a GPU (NVIDIA Jetson Orin Nano) to achieve audio quality that exceeds traditional hardware limitations.

Key Features:
  • Plug-and-Play Simplicity: Just connect via USB. It acts as a seamless bridge in your audio chain, transforming standard PCM into a highly refined output.
  • Extreme FIR Filtering: By utilizing the GPU, it runs ultra-long tap FIR filters(640k taps) for upsampling and correction in real-time—processing tasks that would be impossible for a standard CPU without significant latency.
  • Intuitive UI/UX: No command-line hassle. A clean Web UI allows you to explore different functions, toggle settings, and hear the results instantly with a few clicks.
  • Audiophile-Grade Processing: Dedicated to those who want the most precise reconstruction of their digital music through pure computational brute force.

Current Status: Listening Right Now
I’m happy to say that development is essentially complete. In fact, as I write this post, I am listening to music processed by this exact system. I’m currently running a massive FIR-based upsampling pipeline on the GPU, and the results are stunning. The transparency and ease of the sound—delivered by those ultra-long taps—is something you truly have to hear to believe.

I'm currently thinking about how to release this. In fact, as for the project itself, I'm confident that I've written code that could become the de facto standard for high-end audio.
Any advice you can give me would be greatly appreciated.


1766507711177.png
 
Real-Time GPU-Based PCM Processing: 640k Tap FIR Filtering

The core of this GPU DSP/DDC project lies in its ability to execute massive FIR (Finite Impulse Response) filters in real-time with high precision. While traditional hardware DSPs often struggle with tap counts due to memory and clock cycle constraints, this system leverages the parallel processing architecture of NVIDIA GPUs to push the boundaries of digital reconstruction.

Technical Specifications
  • Filter Length: Up to 640,000 taps per channel.
  • Architecture: Two-stage FIR pipeline utilizing CUDA or Vulkan (VkFFT) for partitioned convolution.
  • Precision: High-precision floating-point arithmetic throughout the signal path.
  • Processing: Real-time upsampling and frequency response correction with minimal audible latency.(44.1kHZ->705.6kHZ / 48kHz->768kHz)

Stopband Rejection and Frequency Response
By utilizing 640k taps, we achieve an exceptionally steep transition band and deep stopband rejection, ensuring that aliasing artifacts are pushed far below the noise floor. The linear phase characteristics preserve the original timing of the recording, while the sheer computational power allows for a "brute force" approach to ideal sinc-function reconstruction.

Minimum phase(44.1kHz)
"passband_end_hz": 20000,
"passband_ripple_db": 1.4962405643359489e-06,
"stopband_attenuation_db": 153.42589925610636,
Linear phase(44.1kHz)
"passband_end_hz": 20000,
"passband_ripple_db": 3.071569665280549e-10,
"stopband_attenuation_db": 239.99996553105115,

Why GPU?
The decision to move DSP tasks to the GPU was driven by the requirement for massive convolution. Standard CPUs, even with SIMD optimizations, face significant overhead when dealing with 640k taps across multiple channels. By using a partitioned convolution approach on the GPU, we achieve:
1. Parallelism: Simultaneous processing of thousands of filter coefficients.
2. Deterministic Latency: Stable real-time performance even at high sampling rates.
3. Headroom: Sufficient computational overhead to run additional neural-network-based processing (such as the integrated limiter) in parallel.

This project demonstrates that modern computing hardware is no longer just for graphics—it is a superior platform for high-fidelity digital audio.
To be honest, it all started when I found Chord's M Scaler too expensive and decided to build my own.
 

High-Precision Parametric EQ via 640k Tap FIR Integration

This GPU-based DSP engine implements parametric EQ by converting traditional filter parameters into a massive FIR pipeline. Rather than using standard IIR filters, the system synthesizes a high-resolution frequency response and integrates it directly into the upsampling convolution stage.

FIR-Based EQ Implementation

To maintain high signal integrity and consistent phase behavior, the EQ is processed using a 640,000-tap FIR filter. By combining the EQ curve with the upsampling filter's impulse response, the system performs both tasks in a single high-performance convolution pass on the GPU.

  • 640k Tap Precision: The long filter length allows for extremely granular control over the frequency magnitude, minimizing the artifacts often found in lower-tap FIR or traditional IIR implementations.
  • Integrated Pipeline: The EQ profile is convolved with the reconstruction filter, ensuring that the entire DSP chain remains computationally efficient and deterministic.

OPRA Integration and Auto-Correction​

The system features native integration with the OPRA (Open Reference Audio) database. Users can search for and download correction profiles for hundreds of headphone models directly through the Web UI.

1766512875451.png


Modern Target Fitting: Beyond Harman​

While the Harman Target remains a benchmark, the latest research from industry leaders - Dan Clark/Dr. Sean Olive suggests specific refinements for a more natural sound. This project is prepared for these advanced requirements:

  • Target Transformation: The engine is designed to adapt standard Harman-based profiles to the latest research targets. It achieves this by precisely synthesizing the delta—typically requiring two specific parametric bands—and merging them into the final FIR coefficient set.
  • Extensibility: While this specific transformation is a highlight, the architecture is not limited to a set number of bands. The FIR synthesis engine can accommodate an arbitrary number of parametric filters, allowing for highly complex correction curves without any additional computational penalty during real-time playback.
Sample_EQ_response.png

Real-Time Flexibility​

Despite the heavy computation required for 640k taps, the GPU (CUDA/Vulkan) handles the processing with ease. Users can adjust EQ parameters or swap profiles through the dashboard and hear the changes in real-time without audio dropouts.
 
Last edited:

Real-Time HRTF-Based Binaural Rendering via HUTUBS​

The spatial processing module in this GPU DSP engine implements binaural rendering for headphone listening by utilizing measured Head-Related Transfer Functions (HRTF). The primary objective is to translate the soundstage intended for loudspeakers to a headphone environment while minimizing the "in-head" localization common in stereo headphone reproduction.

Technical Implementation: The "Dry" Processing Philosophy​

Unlike many consumer-grade virtual surround processors, this implementation intentionally avoids the use of artificial reverberation or simulated room reflections.

  • Cue Preservation: The system focuses exclusively on the fundamental binaural cues: Interaural Time Differences (ITD) and Interaural Level Differences (ILD).
  • Maintaining Transparency: By processing the signal "dry," we avoid the temporal smearing and harmonic coloration typically introduced by recursive reverb algorithms. This ensures that the original recording's ambient information and transient response remain as close to the source as possible, simply re-mapped into a more natural spatial orientation.

Dataset: The HUTUBS HRTF Database​

The engine utilizes the HUTUBS (Head-Related Transfer Functions of the Technical University of Berlin and the University of Southampton) database.
  • Measurement Accuracy: This dataset consists of high-resolution Head-Related Impulse Responses (HRIR) measured under rigorous laboratory conditions.
  • Anatomical Matching: Since the effectiveness of an HRTF is highly dependent on the listener's head and pinna morphology, the system allows users to select from various subjects within the database. This empirical approach to subject selection provides a more accurate spatial fit than a single generic "average" profile.

Unified FIR Pipeline Integration​

The HRTF filter coefficients are synthesized and integrated into the system's broader FIR processing chain.

  • Convolution Process: The HRTF response is convolved with the 16x upsampling and parametric EQ filters to create a single, unified impulse response. This allows for complex spatial filtering without increasing the number of serial processing stages, thereby maintaining the signal-to-noise ratio and preventing cumulative rounding errors.
  • Phase and Magnitude Integrity: The GPU handles the partitioned convolution of these high-resolution filters in real-time, ensuring stable group delay and amplitude characteristics across the audible spectrum.
By prioritizing measured scientific data over heuristic acoustic modeling, this project provides a transparent and stable binaural rendering solution for critical listening applications.

1766514179360.png
 

Achieving 640k Taps with Partitioned FFT and Low Latency for Video​

I wanted to share more details about the core of the signal processing engine: the Partitioned FFT Convolution.

The primary challenge was to handle ultra-long FIR filters (specifically targeting 640k taps) without the multi-second latency typical of standard FFT convolution. To solve this, I have implemented a Uniform Partitioned Convolution algorithm leveraging GPU parallelization.

1. The Strategy: Partitioned FFT​

Instead of performing a single massive FFT that matches the filter length, the impulse response is divided into smaller partitions (e.g., 32k samples).

  • Latency Benefit: The input-to-output latency is determined by the size of the partition, not the total number of taps.
  • GPU Efficiency: By using CUDA/Vulkan, we can process thousands of these partitioned blocks in parallel. This allows the system to compute a 640,000-tap filter with the latency equivalent to a much shorter filter.

2. Real-World Performance & Video Sync​

The current implementation has reached a level where it is perfectly viable for video consumption.

  • The latency is low enough that lip-sync issues are non-existent when watching movies or YouTube.
  • Even with 640k taps, the GPU handles the workload with massive headroom, ensuring zero dropped samples even at high sample rates.

3. Hardware Chain and Latency Constraints​

The current setup uses a Raspberry Pi as a UAC (USB Audio Class) bridge. The Pi acts as the audio input device, capturing PCM data and forwarding it to the GPU machine for processing.

While this configuration is excellent for high-fidelity listening and video, there is a minor trade-off:

  • Gaming Use-Case: Because of the overhead introduced by the USB-to-CPU transfer on the Raspberry Pi and the subsequent bridge to the GPU, "ultra-low" latency required for competitive gaming (like FPS) is currently difficult to guarantee.
  • Audio/Video Use-Case: For everything else, the latency is a non-issue.
1766516707897.png

I forgot to mention that the framework for a web app that can be connected from a smartphone is complete. It has been made into an API, so you can choose the design as you like, and it will be very simple to expand when implementing hardware in the future. It supports multiple languages with i18n, and if you are okay with AI translation, it can handle any number of languages.
 

Current Progress, Stability, and Modern Architecture​

I would like to share an update on the current state of my GPU-based DSP project. The system is now fully functional in my environment, and I’ve been able to verify its stability and performance over extended listening sessions.

1. Current Functional Status​

The processing engine is now operating as intended. It is strictly a PCM-based system; DSD support is currently outside the scope of this project, as the primary goal is high-precision FIR filtering and upsampling of standard high-resolution PCM streams.

2. Automatic Rate Tracking and 700kHz Upsampling​

I have implemented an automatic tracking mechanism for the input sample rate. The system recognizes the sample rate family—whether it belongs to the 44.1kHz or 48kHz base—and adjusts the pipeline accordingly.The goal is to upsample the final output to the 700kHz class (705.6kHz or 768kHz). While the core logic is coded and running, I consider this part to be in "extended testing," as I am still fine-tuning the transition smoothness across different filters.

3. System Stability and Robustness​

One of the most satisfying results so far is the stability of the Jetson environment. Clipping noise, XRUNs, and buffer overflows/underflows—common headaches in DIY Linux audio—have been effectively eliminated on the Jetson side.The backend remains rock solid. The only occasional latency spikes I’ve encountered are traced back to power supply drops on the Raspberry Pi (UAC bridge side), rather than the DSP processing itself.

4. "Self-Healing" Design​

To make this a true "headless" appliance, I have implemented an auto-recovery system:

  • Docker-based Architecture: The entire DSP stack runs in Docker containers with automatic restart policies.
  • Resilience: If the USB connection is interrupted or the power is forced off, the system automatically restores the audio pipeline upon recovery without manual intervention.

5. A Modern Architectural Approach​

What makes this project unique compared to traditional audio devices is the underlying architecture. I am utilizing a modern software stack typically seen in high-performance web or cloud systems (containerization, message queues for control planes, etc.).

By decoupling the UAC input (Raspberry Pi), the UI/Control plane, and the heavy-duty processing (GPU), the system achieves a level of flexibility and power that is quite a departure from conventional fixed-function DSP hardware. It’s been an interesting journey applying these "web-scale" patterns to the world of ultra-low-latency audio.
 

Future Roadmap: Vulkan Migration, Waveform Restoration, and Public Release Plans​

I’d like to close this update by sharing my future roadmap and some of the more "experimental" features I’m currently developing.

1. "Loudness Care": Restoring What Was Lost​

As a side project, I am working on a "Loudness Care" feature—an algorithm designed to reconstruct waveforms that have been crushed by the "Loudness War" (de-clipping and peak restoration).
  • The Challenge: While the theory is sound and works perfectly in offline processing, implementing it in a real-time, low-latency environment is significantly more difficult. It's a work in progress, but I’m determined to make it part of the pipeline.
    1766546447286.png

2. The Move to Vulkan from Cuda: Universal GPU DSP​

In parallel with the feature development, I am migrating the entire compute backend to Vulkan.
  • Portability: This move is strategic. Once completed, it will allow the DSP engine to run not just on NVIDIA hardware, but also on the GPU chips inside modern smartphones and other non-NVIDIA SoCs.

3. Free Community Version for Raspberry Pi 4/5​

If the Vulkan migration is successful, I plan to release a simple Pi-Audio DDC image for free.
  • Target: Raspberry Pi 4 or 5.
  • Spec: A simplified version with tap counts optimized for Pi GPUs (roughly 100k to 200k taps).
  • Features: It will include at least the EQ functionality.
    • You can use network player applications (volumio etc) at same time, because dsp is running in gpu.
  • Timeline: I am targeting January 2025 for this release.

4. Full Version for Jetson Orin Nano​

My current "reference" hardware is the Jetson Orin Nano. If there are individuals or companies interested in testing the "Full Version" (640k+ taps, full pipeline), I am open to sharing the image(not code).
  • A Note on Hardware: Be aware that using a Pi as a UAC bridge requires a fairly complex setup, including I2S conversion and specific kernel configurations. Integrating a dedicated network audio module might be a cleaner path, though that brings its own set of complexities.

5. Final Thoughts: A "Series A" Tech Demo?​

To be honest, this project has evolved beyond what is typically considered "DIY." I view it more as a tech demo—the kind a Silicon Valley stealth startup would build to secure Series A funding. It’s an exploration of what happens when you apply modern, high-scale software architecture to the extreme constraints of high-end audio.

I am working hard to get this into a distributable state as soon as possible. If you find this interesting or would like to try it out, please let me know—I would love to hear your feedback and ideas!

Finally, this is a personal project that I originally started to make in order to get the best sound from my SONY MDR-Z1R. It was really just a hobby that I started developing simply because I wanted to get a good sound. Once I started making it, I had fun, and now it looks like this.
 

Attachments

  • 1766546438471.png
    1766546438471.png
    220 KB · Views: 37
Last edited:
Interesting project!
My only issue would be the noise of the GPU fan. Can you hear the presence of the fan during low volumen sessions?
Would GPU be able to calculate 640k taps fanless?
 
nice, what data type d you use in this case? is float not to limiting?
 
nice, what data type d you use in this case? is float not to limiting?
Now, I use float32 becuase Jetson has heavy penaloty of double.
This is why it's 640k taps not 2M taps.

I have a build option of float64(cuda) but now it doesn't run, has some bugs.
 
Pretty cool. I always wondered about portioned FIR on GPUs.

How many channels of 640k can be handled by the Nvidia hardware? What bitdepth are the calculations done? I assume FP64? Given 512 sample partitions, 96 kHz sampling, you’ll need an estimate 2 GFLOP of FP64 to handle a stereo pair. That means the GPU could handle about 6 of those channels (~ 7 GFLOPs FP64 in total estimated). Larger partitions (more delay) would yield a lower load, so more channels.

As for the claimed miracle sound quality… I highly doubt that. Next you’ll claim to hear things 300 dB down in the noise floor…

Ah, edit: FP32… hmm I don’t think that is good enough, most of your 640k taps will be wasted on rounding errors. Somewhere in the CamillaDSP threads over at DIYAudio there is some info on this. It surely isn’t the end of the world. In vast majority of cases it won’t be audible anyway (but neither will the 640k taps, unless you really need the frequency resolution).
 
Last edited:
> FP32… hmm I don’t think that is good enough, most of your 640k taps will be wasted on rounding errors. Somewhere in the CamillaDSP threads over at DIYAudio there is some info on this.


That's right. Float32 limits stop band about -160db.
Right now, my priority is making it work perfectly, so I'm using Float32(at cuFFT).
Even if the penalty is 1/64 at 1T flotps(Jetson Orin Nano), it should still be capable of 15G flops, so in theory I think it can go even further.
I do have the secret code, but it's still a bit buggy and no sound comes out.
But I'm prioritizing development of other features.
 
Even if the penalty is 1/64 at 1T flotps(Jetson Orin Nano), it should still be capable of 15G flops, so in theory I think it can go even further.
It will do about 225 GFLOPs in FP32. FP64 is about 1/32 of that sadly.

Stay at FP32 for now ;)

This stuff should run on any GPU, so a fully fledged PC GPU should handle this without issues as well. You could even add one to the Pi5 :)
 
It will do about 225 GFLOPs in FP32. FP64 is about 1/32 of that sadly.

Stay at FP32 for now ;)

This stuff should run on any GPU, so a fully fledged PC GPU should handle this without issues as well. You could even add one to the Pi5 :)
Thank you for your advice.
I will continue development with FP32 for the time being.
In reality, even FP32 is not flagship, but I think it will be sufficient.
I believe that the number of taps is more of a time domain issue than a stopband (black background).
I have confirmed that 2M taps in FP32 works, but I am prioritizing reliable operation and have settled on around 640k.
 
Help me.

I'm a beginner when it comes to ASR. I wrote this content on the DIY form, but it might not have been the right place.
I'd like some advice on what to do, such as rewriting it somewhere else or linking to it somewhere.
 
Don’t worry too much :)

These kinds of topics are kind of niche, so don’t expect too much traction initially.

If you want your topic moved, contact one of the mods and ask for guidance. You can also just report yourself via the Report button on every post, and ask to move the topic. These messages will also go to the mods.

Also note that any commercial enterprise posting on the forum must follow some additional rules. Since your status as of yet isn’t very clear the mods may give you some slack.
 
Update: AI-based "Loudness Care" (De-Limiter) Integration and Synergy with Minimum Phase FIR

I would like to share the latest progress on my DIY GPU-based DSP project. The most recent milestone is the successful integration of an AI-powered "Loudness Care" feature, leveraging the De-limiter project (jeonchangbin49/De-limiter).

1.Implementation and Streaming Design:
The implementation utilizes ONNX Runtime with a CUDA execution provider to handle real-time inference. Since the model requires significant computational resources, I have adopted a high-latency streaming architecture:
  • 6-second Latency: Audio is processed in fixed 6.0-second chunks to accommodate the reconstruction overhead.
  • Seamless Transitions: To prevent boundary artifacts, we use a 0.25-second overlap with a raised-cosine crossfade window. This ensures the output is perfectly continuous without clicks.
  • Stability and Fallback: The system includes a fallback mechanism that automatically bypasses processing for a specific chunk if an inference error occurs, maintaining the audio timeline.
2.Synergy with Minimum Phase FIR Filters:
A particularly interesting subjective result is the synergy between this AI De-Limiter and the massive minimum phase FIR filters used in the upsampling stage. Minimum phase filters are ideal for eliminating pre-ringing; however, because they concentrate energy at the very beginning of the impulse response, they can sometimes make highly compressed "loudness war" recordings feel subjectively "sticky" or as if the sound is "pressed against the ear."

The De-Limiter reconstruction significantly alleviates this sensation. By intelligently restoring dynamic peaks and redistributing the "cluttered" energy of the source material, it removes that aggressive pressure. The result is a more relaxed, spacious presentation that preserves the transient integrity of the minimum phase filtering while making loud tracks far more listenable.

3.Performance and Metrics:
Running on an NVIDIA Jetson platform, the efficiency is excellent:
  • Throughput: about 50x real-time speed for 44.1kHz audio.
  • Real-Time Factor (RTF): 0.072.
  • Resource Usage: Average CPU load is ~576% (multithreaded) and GPU load is ~27% during active inference.
Objective Measurement Results:
Objective analysis using loud source material shows clear improvements in dynamic range metrics:
  • Peak Level: Reduced from -0.19 dBFS to -5.34 dBFS.
  • PLR (Peak-to-Loudness Ratio): Improved from 9.41 to 10.69.
  • Crest Factor: Increased from 13.89 dB to 14.70 dB.
delimiter_comparison_nori.png

delimiter_waveform_multiscale_nori.png
 
Last edited:

A Glimpse into the Future (6–12 Months Ahead): The "1-to-1 Audio" Proof of Concept​

Project Overview​

Happy new year, ASR!

I began this project on November 22nd. My core objective was a "proof of concept" to see what a single individual can achieve when partnering with AI.
Everything I wrote in this forum is the result of just one month of development.
By profession, I am an SRE (Site Reliability Engineer) for a Japanese SaaS provider. I am not a traditional audio hardware engineer.
This project is not for business, just for hobby.
However, using AI as a force multiplier, I was able to implement complex features like Loudness Care in practically no time—development started on December 20th, and it was functional within a week.

スクリーンショット 2025-12-31 22.28.04.png

What This Experiment Signifies​

1. The Democratization of Complex Audio Puzzles​

Historically, implementing a system like this required a massive convergence of specialized resources:
  • Substantial corporate capital.
  • Experts in GPU programming (CUDA/Vulkan).
  • Deep knowledge of Audio Theory and DSP.
  • Mastery of the Linux audio layer (ALSA/PipeWire).
  • High-level API and application layer expertise.
  • DevOps and CI/CD proficiency.
  • Professional Project Management.
In the past, only a few "supermen" or high-end manufacturers with million-dollar budgets(like SONY) could play in this niche.
Yet, as a relative amateur in this specific field, I have succeeded.
I don't claim to have superior talent; I am simply leveraging AI effectively. What I can do today, everyone will be able to do in 6 to 12 months. The evolution of AI is moving that fast.

2. Software-Defined Audio (SDA)​

The speed of implementation is the story here. Because I am primarily a headphone user, I wanted HRTF and specific Target Curve functions—so I built them. If I were a speaker user, I would have prioritized Room Correction.

We are entering an era where if a feature is theoretically possible in software, AI can help manifest it almost instantly.

  • Tube Simulators? probably possible in few weeks.
  • Vocal Isolation/Extraction? probably possible in few weeks.
The "Camera" Moment: Just as photography shifted from competing on lens/sensor physics to "Computational Photography" (filters and neural networks), audio is undergoing a shift toward Computational Audio.

3. The Collapse of Mass Marketing & the Rise of "1-to-1 Audio"​

In the traditional mass-market model, developing high-end correction features for niche headphones was a business risk. Unless a product appeals to a wide audience, the ROI isn't there.
But if a specialized feature can be developed in one or two days, the risk disappears.

We can now create "Individualized Audio"—sound that is hyper-optimized for your specific system and your ears—without caring about "market appeal."
The theme for the next year or two will be moving away from mass marketing and toward building long-tail platforms. This experiment is my way of exploring how we build that world.

--------
I'd like to think a little more about what to do with this project.
I'd like to wait a little longer to see if I decide to make it public.

Either way, if there is anyone who would like to think about this kind of future together, discuss it, or even get involved in something, please get in touch.
 
Last edited:
Hello.
Thanks for sharing this very interesting project of yours.
I have thought in the past three-five years if someone is going ti make this type of software for the GPU.
Like Audiovero (Acourate) and similar, but none have done it.
And how much more efficient it would have been, considering how much computing power GPU:s have.
And now your project is here, even if you are mainly aimed at headphones i think it's interesting to read about it and see your progression and how fast it really has gone for you.
Cool project and thanks for sharing.
And i wish you a good new year also.
 
Hi ASR!

I've made the repository public.
The license is a bit unusual, and it's not strictly open source.
You're free to use it for personal use.

There's also Vulkan code, but it hasn't been fully tested.
If you want to, I think it can be used not only on Jetson but also on gaming PCs and the like.
You're free to tinker with the code and try it out, and if there's anything you'd like me to make, please get in touch.

 
Last edited:
Back
Top Bottom