• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

binaural audio finally free (as in freedom)

Joined
Apr 14, 2026
Messages
16
Likes
1
Hi everyone. I am currently building a system that lets people experience surround and immersive formats like Dolby Atmos over regular headphones. Instead of relying on typical "fake" virtual surround tricks, the idea is grounded in real acoustic measurements taken from professional control rooms, which should translate into a much more natural and believable listening experience. A key milestone has already been reached, as I now have a working pipeline that can decode Dolby Atmos content into a format the project can process reliably.

At this point, I am moving into a more experimental phase and would really benefit from some community involvement. I am developing and validating a method for capturing the precise data needed to achieve high quality binaural rendering. To do that properly, I am looking for people who can run a few guided tests on their own systems and share the resulting data.

Right now, the focus is on users who have access to binaural or spatial audio plugins such as APL Virtuoso V2, DeepStereo Monitor, NovoNotes 3DX, or any comparable tools. Support for other setups may come later as the process becomes more refined.

The setup requirements are fairly straightforward: a Windows machine, an up to date Python installation, and a basic level of comfort with following technical instructions. You do not need any background in DSP, acoustics, or mathematics. I will provide clear, step by step guidance to make sure everything is done correctly and consistently.

If this sounds like something you would be open to helping with, I would really appreciate your involvement.
 
What makes this less 'fake surround' than all the other 'simulators' ?

In the end it will always be a 'mix' of calculated signal combination (with or without any delays) that is still reproduced with 2 drivers next to the ear and does not make use of the acoustics of the room it is played back in.
 
What makes this less 'fake surround' than all the other 'simulators' ?

In the end it will always be a 'mix' of calculated signal combination (with or without any delays) that is still reproduced with 2 drivers next to the ear and does not make use of the acoustics of the room it is played back in.
There are three levels:

A. Algorithmic Virtual Surround (DSP):

These are standard, software-based spatializers that attempt to simulate a room environment without personalized anatomical data. Because they rely on generic phase and equalization manipulation, they frequently introduce artificial coloration and generally result in a subpar, unnatural listening experience.

B. Generic Binaural (Dummy Head or Averaged Datasets):

This approach utilizes HRTF profiles derived from acoustic dummy heads or aggregated human measurements. The effectiveness here depends heavily on how closely an individual's unique anatomy aligns with the dataset. When the correlation is strong, the spatial effect is highly convincing and immersive. This is exactly the tier I am currently focused on, specifically, gathering and evaluating these generalized datasets to identify which ones offer the best baseline fit, as they vary significantly.

C. Individualized HRTF Measurements:

This remains the undisputed gold standard for spatial audio. It involves measuring the exact acoustic profile of a user's own head and ears. However, the steep logistical barriers, typically requiring travel to a dedicated acoustic lab to capture a personalized binaural sweep, mean it is a highly exclusive experience that only a very small subset of users will ever have access to.

Binaural hearing can be described as a mapping from a three dimensional sound field to two ear signals, where spatial information is encoded in interaural time differences, interaural level differences, and direction dependent spectral filtering introduced by the head, torso, and pinnae; these effects are collectively modeled as head related transfer functions, so that reproducing the appropriately filtered signals at each ear via headphones recreates the same acoustic cues at the eardrum and allows the brain to infer the original sound direction.
 
Last edited:
The problem with HRTF measured on a specific headphone is that it will differ when measured using another headphone.
You can't 'correct' all headphones correctly using one single measurement of HRTF.

It is a combination of headphone and ear and can even differ between L and R ear.
Also the 'ability' to hear 'depth' with just 2 drivers on the side of the ear not only depends on the technical aspect but also on how the brain is 'wired'.
The final result thus is highly individual, not just acoustics but also the perception part.

The question thus remains... how many heads will correlated with a certain dataset and how that dataset is obtained.
 
The problem with HRTF measured on a specific headphone is that it will differ when measured using another headphone.
You can't 'correct' all headphones correctly using one single measurement of HRTF.

It is a combination of headphone and ear and can even differ between L and R ear.
Also the 'ability' to hear 'depth' with just 2 drivers on the side of the ear not only depends on the technical aspect but also on how the brain is 'wired'.
The final result thus is highly individual, not just acoustics but also the perception part.

The question thus remains... how many heads will correlated with a certain dataset and how that dataset is obtained.
You make a valid point about transducer-to-ear coupling (HpTF) varying significantly between headphones, but modern DSP pipelines are actually designed to account for this by decoupling it from the anatomical HRTF. During the measurement phase, standard practice involves deconvolving the measurement headphone's acoustic signature from the initial binaural sweep to isolate your baseline HRTF. When switching to a new device, discrete Headphone Equalization (HpEQ) filters based on target compensation curves are applied to neutralize its specific frequency response. While you are absolutely right that minor variances in acoustic impedance and physical coupling remain at the eardrum, these deconvolution and EQ techniques are highly effective, allowing a single HRTF profile to translate very successfully across different hardware.
 
During the measurement phase, standard practice involves deconvolving the measurement headphone's acoustic signature from the initial binaural sweep to isolate your baseline HRTF.
That, however, would only be accurate IF the response of the headphone were actually known which it isn't as that depends on the used test fixture(s), production spread, seal and used target.
Headphone measurements below 100Hz and above 3kHz are unreliable despite many owners/manufacturers of industry standard test fixtures would like to think.
Of course they ARE compliant to the standards they are designed for but unfortunately human ears mostly do not conform.
 
That, however, would only be accurate IF the response of the headphone were actually known which it isn't as that depends on the used test fixture(s), production spread, seal and used target.
Headphone measurements below 100Hz and above 3kHz are unreliable despite many owners/manufacturers of industry standard test fixtures would like to think.
Of course they ARE compliant to the standards they are designed for but unfortunately human ears mostly do not conform.
Ultimately, the true measure of spatial audio is practical efficacy, namely achieving a convincing end result for the listener. Despite the theoretical compromises you mentioned, the foundational science actually generalizes across a surprisingly wide variety of anatomical profiles much better than one might expect. The reality is far less bleak than it appears on paper, a conclusion I can confidently support given that evaluating and investigating the real-world performance of this is exactly what my current work focuses on. Ultimately, I am actively seeking individuals with access to the aforementioned datasets in the OP. If you are in a position to share or contribute that data, I would welcome the collaboration.
 
Sadly I do have windows running on an older PC (for certain needs) but the rest works on Linux.

The few demo's I heard never convinced my brain there is a 3D image present, also not with binaural recordings.
Tried various crossfeeds, and heard demo's of 'spatializers' and 'realizers' and while this does help with some recordings in making the sound less 'stereo' it can't persuade my brain that I should be hearing depth.
I reckon as soon as I put on headphones (and do that a lot) my brain knows it is coming from the sides.
I would be of little use in this endeavor I'm afraid. I am a hopeless case in hearing 3D, except in real life.

It took me a while to get used to and accept a headphone as an entirely different way of listening than speakers and kind of accepted that these are 2 so very different methods of listening.
 
Sadly I do have windows running on an older PC (for certain needs) but the rest works on Linux.

The few demo's I heard never convinced my brain there is a 3D image present, also not with binaural recordings.
Tried various crossfeeds, and heard demo's of 'spatializers' and 'realizers' and while this does help with some recordings in making the sound less 'stereo' it can't persuade my brain that I should be hearing depth.
I reckon as soon as I put on headphones (and do that a lot) my brain knows it is coming from the sides.
I would be of little use in this endeavor I'm afraid. I am a hopeless case in hearing 3D, except in real life.

It took me a while to get used to and accept a headphone as an entirely different way of listening than speakers and kind of accepted that these are 2 so very different methods of listening.
Until someone is willing to contribute the necessary datasets, and I can explain the exact technical process to anyone interested in helping, I will continue building this project using publicly available research data. It will not be the same, but it is still a very, very meaningful improvement. That said, based on what you mentioned, it likely will not be useful to you personally.
 
You need to elaborate more on what 'data' you are looking for and expect what exactly is required from the helpers.
 
What are you trying to achieve here?
A. Algorithmic Virtual Surround (DSP):
Thousands of spaces measured and even more simulated in many formats ready to be used for virtual surround upmixing. The problem is the audiophile community is not ready to accept it that you need to use one if you want binauralization. They don't realize if they put a speaker in a room they are basically physically upmixing to surround with their room. Meanwhile Apple sneaked in 2 BRIRs to Spatial Audio without telling anyone.
B. Generic Binaural (Dummy Head or Averaged Datasets):
Also so many available, what is new here?

C. Individualized HRTF Measurements:
Yes. streamlining here would be great.

Here is my take. I don't need an other script. We are knee-deep in scripts which nobody use because of complexity.
What I'd like to have an integrated solution that makes it easy and accessible to anybody. Like Apple Spatial Audio, but better. Too bad Apple messed it up at so many level to be suitable for critical listening, but it is still the most widely used.

If I could cut and paste features, my ideal setup would be:
Apple's head scanning to create a personalized sofa file
ANC headphone or earbud using the built-in microphone for on your head measurements and corrections
APL virtuoso like standalone app or system extension to use the sofa file and corrections systemwide
Headtracking
Possibility to pick your proper virtual environment freely within the app (not the ASPEN crap in APL). Just using any reverb will not do it. Tellingly, Audio Ease had Altiverb for multichannel studio work, and 360pan to create virtual spaces - they are not the same.
 
You need to elaborate more on what 'data' you are looking for and expect what exactly is required from the helpers.
Here's the specific support I'm looking for:

* Someone who has access to immersive binaural audio rendering plugins in VST3 format, such as (but not limited to):

-- APL Virtuoso V2
-- DeepStereo Monitor
-- NovoNotes 3DX

* Etc.

* Access to the Reaper DAW (or an equivalent alternative that has full and complete support for 16+ channel audio)

* Basic technical proficiency, including:

- General computer usage (Windows or Mac)

- Installing and working with Python scripts and related workflows

I would appreciate it if anybody has access to these tools and is also willing to donate some of their time towards this community project. If you have limited experience in DSP, your contribution will primarily be the extraction of binaural rendering data from these tools. The method of said extraction is one that I will provide to you. The nature of how to run it will also be provided to you, but you must have at least some understanding of what you are doing, for obvious reasons. If you have advanced experience in DSP, let me know as well.
 
Last edited:
ASPEN crap in APL
What is wrong with the virtual room in Virtuoso?
Genuinely interested as I know nothing about the internals of this kind of software.
And Virtuoso worked for me better than anything else I encountered so far.
 
What are you trying to achieve here?

Thousands of spaces measured and even more simulated in many formats ready to be used for virtual surround upmixing. The problem is the audiophile community is not ready to accept it that you need to use one if you want binauralization. They don't realize if they put a speaker in a room they are basically physically upmixing to surround with their room. Meanwhile Apple sneaked in 2 BRIRs to Spatial Audio without telling anyone.

Also so many available, what is new here?


Yes. streamlining here would be great.

Here is my take. I don't need an other script. We are knee-deep in scripts which nobody use because of complexity.
What I'd like to have an integrated solution that makes it easy and accessible to anybody. Like Apple Spatial Audio, but better. Too bad Apple messed it up at so many level to be suitable for critical listening, but it is still the most widely used.

If I could cut and paste features, my ideal setup would be:
Apple's head scanning to create a personalized sofa file
ANC headphone or earbud using the built-in microphone for on your head measurements and corrections
APL virtuoso like standalone app or system extension to use the sofa file and corrections systemwide
Headtracking
Possibility to pick your proper virtual environment freely within the app (not the ASPEN crap in APL). Just using any reverb will not do it. Tellingly, Audio Ease had Altiverb for multichannel studio work, and 360pan to create virtual spaces - they are not the same.
Here is what I am trying to achieve:

1. Identify the highest quality dataset available for converting immersive audio into binaural audio for headphone playback:

In an ideal scenario, a user would obtain a personalized HRTF by visiting a university lab and having their ears measured. In practice, this is not realistic for most people. Because of that, a well-constructed generic dataset is needed, one that is built from high quality measurements and can perform well across a wide range of listeners. I am seeking community help to gather and contribute to this kind of dataset.

2. Develop a processing method that can interpret the contents of a Dolby Atmos file, including dynamic audio objects:

The goal is to extract and translate this information into an intermediate representation that can then be convolved into binaural audio.

3. Build a desktop tool that takes an input audio file and converts it into binaural output:

This includes support for Dolby Atmos, LPCM, and a limited subset of DTS formats. Immersive DTS formats and Auro-3D are not supported at this stage. The scope is intentionally limited due to development constraints and the fact that Dolby Atmos is currently the most widely used format in consumer scenarios.

4. Package the entire workflow into a user-friendly solution:

The end goal is to allow users to take arbitrary audio files and convert them into binaural audio for headphones with minimal effort.
 
What is wrong with the virtual room in Virtuoso?
Here is the IR from APL listening room:

APL Listening room.jpg


A couple of evenly spaced high passed early reflections, and a short reverb trail. Very artificial.
Here is a real control room at the same scale:
Control room.jpg


ASSPEN is better than nothing but in my experience it is far from good. Usually good enough for pop-jazz. For opera, classical it is inadequate for me.
Not to mention that even if the simulation would be better, the max room size of 10m x 10m x10m is very small for a real performing venue.
 
Here is the IR from APL listening room:

View attachment 525452

A couple of evenly spaced high passed early reflections, and a short reverb trail. Very artificial.
Here is a real control room at the same scale:
View attachment 525453

ASSPEN is better than nothing but in my experience it is far from good. Usually good enough for pop-jazz. For opera, classical it is inadequate for me.
Not to mention that even if the simulation would be better, the max room size of 10m x 10m x10m is very small for a real performing venue.
You aren't looking at a music waveform; you're looking at an IR. This measures how a room reacts to a sudden, sharp sound like a handclap. Because the goal of a great listening space is to hear the speakers and not the room, an ideal IR is a single, sharp spike for the direct sound followed by absolute silence. (as you are aware)

In the real world, that initial spike is followed by room reflections. The APL Virtuoso graph, with its sparse lines, actually demonstrates a vastly superior acoustic environment. The sound energy drops off rapidly, which means acoustic treatments or software are successfully neutralizing unwanted echoes. This leads to precise, clear audio.

You are confusing an acoustic measurement with a music track. In an IR graph, a thicker waveform doesn't mean better sound or more complex music. It simply means more acoustic distortion. The APL Virtuoso graph isn't missing data at all. It's missing destructive room noise, which is exactly what you want.
 
A couple of evenly spaced high passed early reflections, and a short reverb trail. Very artificial.
Here is a real control room at the same scale:
How can a "couple" be evenly spaced?
The reverb in Virtuoso can be adjusted. And why is a short tail bad? Some control rooms (Nevell) make considerable effort to have little reflections/reverb.
A generic "real control room" certainly is not what I would be aiming for.
[Blackbird Studio C might be the exception, but that one is not the usual type of control room.]
Not to mention that even if the simulation would be better, the max room size of 10m x 10m x10m is very small for a real performing venue.
Why would one want a room of the size of a concert hall in the simulation? This is where the recording was done. The mix is done in and aimed at a different kind of room.
For opera, classical it is inadequate for me.
What solution would be adequate in your opinion?
 
Last edited:
How can a "couple" be evenly spaced?
The reverb in Virtuoso can be adjusted. And why is a short tail bad? Some control rooms (Nevell) make considerable effort to have little reflections/reverb.
A generic "real control room" certainly is not what I would be aiming for.

Why would one want a room of the size of a concert hall in the simulation? This is where the recording was done. The mix is done in and aimed at a different kind of room.

What solution would be adequate in your opinion?
Let me explain this for anybody who is unfamiliar, there are usually two dominant approaches to headphone-based binaural audio processing:

Method #1: Free-Field Model

Free-Field HRTFs. The DSP convolves the dry audio with an impulse response measured inside an anechoic chamber or otherwise made to appear as so. This mathematically applies the ITD, ILD, and the specific frequency filtering caused by the human pinnae, head, and torso. It strictly excludes any environmental acoustic data or room boundaries.
This has in general we can say nearly absolute phase coherence and spectral neutrality. Because there are zero simulated early reflections or reverberant tails, there is no destructive interference or comb filtering. The frequency response, timbre, and transient dynamics of the original source file remain completely uncolored and pristine.

Having said all that, it can sometimes suffer from poor externalization. The human auditory system heavily relies on room reflections to judge distance and spatial depth. Without these acoustic cues, the audio typically suffers from in-head localization.

Method #2: BRIRs

The DSP convolves the audio with a complex impulse response that captures both the HRTF and the specific Room Impulse Response (RIR) of a physical space. It calculates the direct sound, the early specular reflections bouncing off physical boundaries, and the dense, stochastic late reverberation field. This calculation is governed by the RT60 and the physical absorption coefficients of the modeled room. There is usually superior externalization and real-world translation for mixing and mastering purposes. By feeding the brain the complex spatial and temporal cues it naturally expects in an enclosed space, the sound is perceived as coming from physical speakers in an actual room. But, there is inevitable spectral coloration. Introducing early reflections and a reverberant tail mathematically guarantees some degree of phase shifting and comb filtering. The raw timbre of the source file is definitively altered by the acoustic signature of the simulated room.
 
Last edited:
In practice, this is not realistic for most people. Because of that, a well-constructed generic dataset is needed, one that is built from high quality measurements and can perform well across a wide range of listeners. I am seeking community help to gather and contribute to this kind of dataset.
I do not think that is a solution.
The best average size of shoe will still not fit most people. And ears are more different than feet.
I think Virtuoso is doing it right. Giving the possibility to choose an HRTF from a collection in the app, to search a best personal fit among the available databases or even to have somehow (Genelec, labs, ...) get a personal .sofa created.

If only there were a streamlined solution to do what Apple does. Scan the head-ears with photos/video and create an HRTF from the geometry.
The end goal is to allow users to take arbitrary audio files and convert them into binaural audio for headphones with minimal effort.
So you want to build a competition to the available ones that do that already?
Headtracking?

The frequency response, timber, and transient dynamics of the original source file remain completely uncolored and pristine.
I would not call the sound reproduction in an anechoic chamber uncoloured or pristine. Maybe in a technical sense, but not in respect to perception.
To reproduce the mix in a (real or simulated) anechoic chamber will by all means not "sound right". Certainly not in stereo but probably not in multichannel either. The
The sound in a concert hall is the result of comb filtering, but in the recording this is not the kind of comb filter that anyone in the hall will experience. It is a wild mixture from mics hanging in the air somewhere and most of the time combined with lots of other mics in other positions. All this is put together by hand based on the impression that another comb filter (the studio) produces for the sound engineer.
It will not sound like in the concert hall in any case, but without the studio reflections it will not sound like it is heard in the studio either. Chances are, few people will like it.
 
Back
Top Bottom