• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Threads of Binaural virtualization users.

Someone on reddit used very tiny MEMS microphones (AMM-2738-B-R) for measuring the frequency response near the eardrum , unfortunately there is not much space to solder the contacts.

Any idea, do we need an appropriate controller board from the manufacturer?

Here are some additional results regarding the differences caused by the insertion dephth.

Small MEMS mics could be the next step to encounter that problem, also for the required personal headphone equalisation.
 
I don't think there's any point at paying huge amounts of money for those systems :
- They rely on converting 3D geometry in HRTFs, which can have its degree of inacurracies.
- They rely on average frequency response curves for headphones, which is more or less accurate based on the unit-to-unit variations of the specific headphones model. Also, your own ears can be different from the averaged ones of the measurement rig used to get those frequency responses in the first place.

I think only Sony is doing the right thing, by physically recording your own personal HRTF :


However you're limited by their 360 audio format, and they only do this in Tokyo, New York and Los Angeles. Quite a bummer :confused:

This measurement is now available in London, UK and also Nashville in the US.

But it relates to their professional *studio* virtualization software Sony 360 Virtual Mix Environment and they essentially record the room the measurement is made in with mics down your ear canals so that that room can be recreated virtually.


So, you get the option to get the measurements done in the studio doing the testing - HHb's studio in London, with a Genelec Dolby Atmos system - or you can pay .ore for them to travel to so the measurements are a studio of your choice (go hire a Dolby Atmos demo space or speaker manufacturers showcase studio, if you don't have one or know anyone who does .. one that's the best you know of! ) .

Then you pay a yearly subscription for the virtualization software, standalone software, so you could set up a suitable mini pc, NUC or mac mini exclusively for the purpose, work out how to get the audio or files over to it .

In the UK, it's £100 for the measurement at their place, £250/yr for the software licence .

Only negative bit is it's intended for use with Sony MVR1 headphones .. but I would have thought you could ask to have measurements done with a specific pair of headphones you already own ?

The headphones become part of the calibration with this system .. I believe measurements made without headphones, playing audio all around with the surround system, happen first, then measurements taken with the headphones on swcond.

The way of recreating the first via the second, is worked out ...and I think a transferable HRTF + headphone calibration is also extracted from the data so you can then work with other virtual rooms.. including rooms created in CAD apparently, if all acoustic properties of materials are in that CAD data. So you can the virtualize spaces sonically as an architect ..

It coming to London is a pretty exciting thing in this virtualization area..
 
The last few months I've put in a lot of work towards binaural virtualization. My first attempt was in 2021, when I contacted a few local anechoic chamber facilities. Each one told me they generally don't make them available to civilians. Nothing happened then because of a lack of interest on their side and high costs (thousands) even when they were willing. Also, this all-real approach was not the right one in any case, since I learned a little later that virtualization using anechoic in-ear measurements is limited to the exact position of the speaker. Many separate and exact measurements have to be made, which is not feasible if the facility doesn't already have a setup specifically for HRTF capture.

After seeing this thread and @fcserei's successful and fairly low cost attempt, I decided to try scanning. I don't have a scanner and scanning myself seemed like too much of a pain, particularly since, if you have ever seen the raw output of handheld scanners, a lot of garbage polygons and surface weirdness are generated and have to be cleaned up. If you take many different scans to fully cover your head and torso, those individual scans have to be precisely spliced together for a complete model. Overlapping geometry, holes and twisted membranes will cause problems and have to be avoided.

I contacted an engineering firm that scans patients for prosthetics and they agreed to help. The price was $600 USD.

I was scanned in multiple passes using an Artec Spider II for my ears and Artec Leo for my torso and head. The Artec software, with a little guidance by the engineer in charge, automatically splices together the partial models generated by each pass. It specifically uses texture cues, since the scanning resolution is in the micron range. This capability is particularly helpful because I did not have to be clamped, sit in a vice, or stick tracking dots all over, all of which are common practice.

The session took two hours in a small room with a photographer's diffuse lights. I sat in a chair and was told to move as little as possible while scans were in progress. General advice was to dress in a skin tight white shirt, shave and wear a head cap. No branding or logos should be visible. Scanners in general are bad at dark and reflective surfaces, and the unevenness that comes with clothing and hair. During the session, the engineer also sprayed dust in my ears (while I had ear plugs in) to help the scanner pick up fine curves and eliminate shine. It had an especially hard time detecting the undersides of the helix (outer rim of the ear) and antihelical fold (one of the ridges in the middle), and the retroauricular sulcus (the back side of the ear and its connective tissue). He also stuck masking tape on the back of my head to flatten the hair coming out of the cap.

The raw scan compilation was filled with thousands of floating segments of visual garbage, I would guess as a consequence of the extremely high resolution of the scanners. It also included parts of the chair I sat in, the small ridges of the swim cap I wore, the folds and neckline of my unfortunately too-loose shirt, parts of what facial hair I had and the added confusion of earplugs in some passes and others without. All in all I think the software was attempting to stitch together over a dozen different passes whose coverage overlapped and which featured some small holes despite the engineer's best efforts. He would spend time processing these after I left.

I received a completed model a few weeks later. It did not have defects, but it was incredibly detailed and would need manual work before running through mesh2hrtf. At the outset the mesh count was 400k triangles.

When I started I had no experience with 3D editing software. I used a combination of Meshmixer and Blender. They are without question hard to use. Thankfully Meshmixer has a more or less intuitive adaptive smoothing function and a strong set of analytical tools for finding defects or visually displaying unusually oriented vertices. With some work I reduced the fine features of my face and torso, including rounding the flat surface sealing my truncated shoulders and chest—the HRTF simulation requires the models to be watertight (also called manifold) for the HRTF simulation.

Depending on the editing software, repairing holes and smoothing can produce insane geometries. Blender in general defaults to complex polygonal faces which best fit the selection. These then have to subdivided into triangles. Meshmixer doesn't have this issue, and defaults to triangles, but you have less fine control, and where the mesh has sharp discontinuities or edges the results are not predictable.

My first attempt to run the HRTF calculation script failed because I did not sufficiently simplify the model. At 200k triangles it refused to run.

I reduced the triangle count to around 100k after a few days of working in Meshmixer. The ears I barely touched apart from smoothing the artificially sculpted blocked canal, since the scanner wasn't able to capture much deeper than the concha (the grooved area leading to the canal, where IEMs usually sit). I then ran the mesh2hrtf optimization script, which produced individual left and right models of around 40k triangles each.

I tried calculating again but ran into a major issue. For the highest frequencies, a very large amount of RAM is required—more than the 32gb I had installed. I am decent with computers, but I don't know how to troubleshoot situations like this. The script would not recognize virtual RAM allocated on my SSD.

I took a break from messing with the model and found a local audiologist that offered ear scans using the Otoscan system instead of the usual injection molds for custom earplugs and hearing aids. I convinced her to make the scans and send me the files. These were high resolution scans of my canals to a depth of 20mm, which is fine. Normal adult ear canals are 25-30mm long. These cost around $70 USD.

I asked the engineering firm for help with splicing these with my earlier head and torso scan and they agreed to do so. Here I should add that I suspected everyone who had reported poor virtualization results either used too low of a scanning resolution or, because the scans had no ear canals, had the misfortune of needing more personalized high frequency information necessary for externalization (hearing sounds whose apparent position is outside of your head) and especially frontal localization. Others for unclear reasons did not have problems with clear frontal localization. There is evidently some level of luck involved.

I also eventually purchased and installed 64gb of RAM. My computer is a Windows 11 passively-cooled machine hiding in a low cabinet in my living room. A small, silent Noctua fan forces air out of a slotted opening on its rear side. The case and power supply are made by Streacom, the motherboard is a ASRock Z690M-ITX/ax and the CPU is an Intel i3 13100T with four cores. Without modern thermal throttling I'm sure the whole thing would have melted. 3-8 calculation instances were running at a time, consuming around 90% of RAM and CPU on average.

The calculation took around 52 hours to complete, but failed to produce usable results in the end due to what are called nonconvergence issues, where the calculation is unable to find a solution for the HRTF at a particular frequency.

These issues are caused by problems with the model geometry. To prevent them, mesh2hrtf has a mesh optimization script, but this script was clearly not meant for models with ear canals included. Normally the chosen ear is left more or less untouched while the mesh of rest of the head is simplified, especially the opposite side ear, which disappears completely apart from a few bumpy artefacts. In this case, the script generated a twisted version of the opposite-side ear canal. I think it tried to reduce it to nothing and close the hole, but couldn't because the geometry reached deeper than it expected. Manual work with Blender helped here.

I'm not completely sure what to do next. I think the nonconvergence issues, which started only at 15kHz and up, are to do with the geometry of the ear canals specifically. The furthest extremities are a little jagged and I left them that way because working with geometry positioned inside a larger model is a struggle. I also didn't want to arbitrarily introduce new shapes, but that was probably the wrong weighing of priorities.

While Meshmixer has the better smoothing function, it does not react well to deeply concave surfaces. It will take more manual effort with Blender for a good result.

Whenever I finally get all this working I will compare the tradeoffs of head only to head and torso simulations, headtracking vs. none and visual input vs. none. The first is a matter of accuracy, while the others apparently strongly determine the believability of auditory scenes. If you know the sound will not match the space you're in, then that cognitive dissonance is supposed to be enough to ruin auditory processing and perceptual stability.

Edit: Typo.
 
Last edited:
Great Post! Thank you for all this information.
This is what I would call a truly great effort.
But maybe you are overdoing it in terms of resolution (microns?)
20kHz is 17mm wavelength and that frequency range does not even make a big impact anyway, so a millimetre resolution (at the pinnae) should do more than ok ( especially if the mesh is smooth).
Whenever I finally get all this working I will compare the tradeoffs of head only to head and torso simulations, headtracking vs. none and visual input vs. none. The first is a matter of accuracy, while the others apparently strongly determine the believability of auditory scenes. If you know the sound will not match the space you're in, then that cognitive dissonance is supposed to be enough to ruin auditory processing and perceptual stability.
I cannot comment on the enormous task of all these technicalities. But about head tracking, I can say that it makes a huge difference in realism and believability even as I do not move my head much or often. It seems to me as if the brain “learns” from the head movements in order to make (spatial) sense of the signal.
For cognitive dissonance from the visual input I use a simple solution: I close my eyes - this is something I do in live concerts too.

EDIT in blue
 
Last edited:
Great writeup.

If you have convergence issues only at high frequencies, you can do the following to hear what you achieved so far:
Copy the last good simulation result in the output directory to all the higher frequency slots above the failure.
Also edit the log files to duplicate the last successful frequency log to the failed slots, and modify the step frequency in sequence.
This will trick the merge scipt to produce an output, which will be correct up to the failure point, and a 6 dB / octave slope down after that. Pick the lower frequency of failure of both ears and do it for both above this, otherwise it can introduce ringing at high frequencies
I have used this to preview my results, when I was impatient to wait for the full simulation.
 
But maybe you are overdoing it in terms of resolution (microns?)
20kHz is 17mm wavelength and that frequency range does not even make a big impact anyway, so a millimetre resolution (at the pinnae) should do ok ( especially if the mesh is smooth).
That's exactly right. I had originally instructed the engineer to use the resolutions indicated in the table below, from a relevant study: https://publications.rwth-aachen.de/record/793260 But he used the native resolution of the scanners. He did reduce the complexity in the finished model, and the smoothing and mesh optimization steps I took did so even further.

Screenshot_20251026_064553_M365 Copilot.jpg
 
Whenever I finally get all this working I will compare the tradeoffs of head only to head and torso simulations, headtracking vs. none and visual input vs. none. The first is a matter of accuracy, while the others apparently strongly determine the believability of auditory scenes. If you know the sound will not match the space you're in, then that cognitive dissonance is supposed to be enough to ruin auditory processing and perceptual stability
I am a bit more pragmatic than this. I do not believe that utmost accuracy is the key for realistic illusion. The shape of my head is slightly changing during the day, head position relative to the torso is changing whenever I turn my head, my clothes are changing, the speed of sound is changing with temperature, etc. None of them prevents me to hear the 3D soundscape around me.

Instead of chasing arbitrary preference curves, perfect frequency response etc, in my opinion it is much more important to create as many auditory clues properly (not perfectly) as possible.

(My pet peeve is the two speaker stereo reproduction in a small room which creates none of them right.)

With binauralization I can use a decent personal hrtf with the important hf dips, and with headtracking the image is more naturally outside of my head. Free of the problems of the small listening room I can also create a compelling acoustic environment with all the reflected sounds closer to real directionality, timing, intensity and tonality.
All of these contribute to a realistic soundscape.

Of course visual clues still have precedence over acoustical ones, so low visual stimulation environment or darkness always help. It is easier to imagine that I am in a concert hall if I am not looking at a brightly lit garden.

Here is my "take" on binauralization. It is EQd for AirPods Pro2 and my personal DFEQ, but should be reasonably transferable on Harman like headphones. The track is witching back and forth between the regular stereo and the processed signal a couple times with dropouts, because latency issues. No head tracking in this, so it is harder to assess the distance of the musicans.


A longer one, just staight processed.Listen till the end, a lot of things happening:

 
Curious to know what are your guys' experiences with IEMs vs Headphones for virtualization. In my experience, I find IEMs to not sound as realistic as open back headphones, even though they sound better frequency response wise.

Also, how are you all EQing your listening devices for virtualization? Using an arbitrary target curve or doing something more?
 
I've compared my results of the mesh2HRTF calculations of my headscans to the Apple Personalized Spatial Audio headscan results.
You can scan your head with an iPhone in 30 sec to create personalized profile for Apple Spatial Audio, and you can use the results across all of your Apple devices to binauralize stereo or multichannel content.
Below the results for the standard 5.1 directions measured at the Left ear.

L.jpg
R.jpg
C.jpg
BL.jpg
BR.jpg

The two very different methods are surprisingly well tracking each other, the basic shapes are the same, the important HF dips are virtually at the same frequencies.
So the mesh2HRTF and the Apple san produced virtually the same results.
In listening the mesh2hrtf and Apple solutions are producing the same spatial accuracy for me. If I switch to the generic Apple profile, tonality changes and directions are changing, mostly in elevation.

The Apple curves are smoothed. From the unsmoothed curve it seems the Apple renderer also applies a relatively small room BRIR to the HRTF. I attribute most of the raggedness and level differences in the 100-1k range to this.

L Movie.jpg

There is also a MOVIE mode, when you connect your headphone top an AppleTV, but that messes up everything, Bass boost, HF mess and more room interaction.

There are three files in the OS, Reverb_General.ir Reverb_General_Personalized.ir and BRIR_GEneral_Personalized.ir, which I suppose control these added BRIR curves. I am working on it to figure out the format, bypass them or replace them with flat responses, and check afterwards.

So in conclusion the 30 second Apple iPhone scan is producing the same result as a full mesh2hrtf calculation, but is is messed up with the usual Apple lockdown. If only they would be more flexible letting us to pick BRIRs. But they don't even let us easily EQ headphones.
 

Attachments

  • L unsmoothed.jpg
    L unsmoothed.jpg
    109.2 KB · Views: 25
Last edited:
I am a bit more pragmatic than this. I do not believe that utmost accuracy is the key for realistic illusion. The shape of my head is slightly changing during the day, head position relative to the torso is changing whenever I turn my head, my clothes are changing, the speed of sound is changing with temperature, etc. None of them prevents me to hear the 3D soundscape around me.

Instead of chasing arbitrary preference curves, perfect frequency response etc, in my opinion it is much more important to create as many auditory clues properly (not perfectly) as possible.

(My pet peeve is the two speaker stereo reproduction in a small room which creates none of them right.)

With binauralization I can use a decent personal hrtf with the important hf dips, and with headtracking the image is more naturally outside of my head. Free of the problems of the small listening room I can also create a compelling acoustic environment with all the reflected sounds closer to real directionality, timing, intensity and tonality.
All of these contribute to a realistic soundscape.

Of course visual clues still have precedence over acoustical ones, so low visual stimulation environment or darkness always help. It is easier to imagine that I am in a concert hall if I am not looking at a brightly lit garden.

Here is my "take" on binauralization. It is EQd for AirPods Pro2 and my personal DFEQ, but should be reasonably transferable on Harman like headphones. The track is witching back and forth between the regular stereo and the processed signal a couple times with dropouts, because latency issues. No head tracking in this, so it is harder to assess the distance of the musicans.


A longer one, just staight processed.Listen till the end, a lot of things happening:

According to Sony research (for their immersive virtualization software, used in professional environments, finalised and pushed into a product when COVID lockdowns hit), more accurately personalised HRTFs are essential for more accurate perception in the vertical plane .. but even then, accuracy falls off above and below 30' (perpendicular to your face being 0') . They also found it increased accuracy of sounds from behind you (which can often be perceived as being In front if the HRTF isn't a good match).
 
Hi everyone, I’ve been following this thread and experimenting with Mesh2HRTF on and off for a few years. Like many of you, I became determined to overcome the dozens of friction points in the process—from "dreaded" non-convergence errors to the general file management headaches.

I ended up scripting my way out of most of the issues, which led to me building a Python-based orchestrator app. It essentially acts as a step-by-step wizard for the entire workflow.

I wanted to high-level overview of what I’ve got working so far (I will post a detailed breakdown later):
  • Scanning: I’m using a Creality CR-Scan Otter, which handles hair much better than the iPhone apps I tried previously.
  • Mesh Prep Automation: I made a tool to auto-align/center the mesh, then optimizes the mesh for the grading tool using pyMeshLab's "isotropic explicit remeshing'. This creates very uniform triangles that the Mesh2HRTF grading tool handles well.
  • NumCalc Stability: As @Curvature has noted, the simulation phase is looong and can be risky. I recompiled NumCalc to stop immediately upon critical non-convergence errors, and my app includes an option to test-run the highest frequency first, saving potential wasted hours.
  • Cloud Speed: On my local rig, a sim takes ~12 hours. I’ve successfully offloaded NumCalc to a 32-core Amazon EC2 instance, which cuts the run time down to under 2 hours. I'd be happy to share how I got this working in case anyone is interested.
  • Outputs: The app auto-generates 44.1 and 48kHz SOFA files, including versions with Diffuse-Field equalization applied (crucial for plugins like APL Virtuoso that don't do it internally).
  • Diffuse Field Response: After many failed attempts, I was able to add a function to generate Diffuse Field responses from the SOFA files (the "Generate Extras" step), including the option to apply a dB per octave tilt. This is useful for equalizing headphones, but you need binaural measurement mics to use it.
  • Head Tracking: I couldn't get facemeshtracker working on my laptop, so I made an python bridge app that converts AITrack output into the specific formats needed for Sparta Binauraliser or APL Virtuoso. I haven't tried converting the SOFA files for HeSuVi.
Let me know if anyone is interested in learning more about the orchestrator app or the OSC bridge app or wants to help test either of them out.

1767588028932.png
 
I've compared my results of the mesh2HRTF calculations of my headscans to the Apple Personalized Spatial Audio headscan results.
You can scan your head with an iPhone in 30 sec to create personalized profile for Apple Spatial Audio, and you can use the results across all of your Apple devices to binauralize stereo or multichannel content.
Below the results for the standard 5.1 directions measured at the Left ear.

...
This is super interesting. I really enjoy my own Air Pods Pro (especially for watching movies), but I'm curious how you captured measurements for each of the channels like this?
 
but I'm curious how you captured measurements for each of the channels like this?
In Apple Logic Pro if you set up a multichannel project, it has a Spatial Audio Monitoring plugin selectable for multichannel buses in the Imaging group. It is using the system renderer for monitoring.
Because the system renderer available as a plugin, the output can be redirected anywhere. So you can use the headphone rendering for other headphones too, not just the compatible Apple ones, or you can use the built in speaker's crosstalk cancellation and virtualization to external speakers, or you can redirect it to a measurement app with a virtual audio driver. Logic Pro also cuts down the latency of the AirPods over bluetooth from ~250 msec to 60 msec, so head tracking is not lagging.
Logic Pro Also has the SpaceDesigner reverb, which can do real 3D reverb with B format impulse responses. It can be used to overwhelm the Apple renderer built in relatively mild BRIRs, so you can transpose your music to different environments.
 

Attachments

  • Untitled.jpg
    Untitled.jpg
    70.6 KB · Views: 11
Last edited:
Let me know if anyone is interested in learning more about the orchestrator app
I 'd be happy to try. I did the simulation on a 8GB Mac. It took me 5 days. Nice tso see some progress in the usability of mesh2html. Although seeing that how close the results are of the Apple's 30 sec scans ...
Anyway, the lack of resources in my case had a couple of pragmatic methods. When the simulation finished the first half of the run to around 15kHz, which is usually less than a day, I can duplicate the last completed step's data output to all the higher slots, edit the frequency in the headers and run the finalize step. This way I can preview the result without waiting for other four days to complete the run. The result is usually good enough. Below is the black line shows what will be the difference between the completed and the truncated results.)
Truncated.jpg

In case of convergence errors, instead of hunting for that couple of wrong size or direction triangles in the model, you can use this preview to see (and hear) whether your model is worth fixing.
 
Ah, that makes sense re the spatial audio measurements. It's simply not possible to do that on a Windows PC. I'm thinking of getting a Mac so that I have more options with respect to personalized Atmos with SOFA files vs the 0 options on Windows.

If you want to do a quick test to see if your NumCalc will fail, open a terminal in your project folder "source_1" folder and run `NumCalc.exe -istart 140 -iend 140` , replacing "140" with whichever is the highest instance based on your project settings. You can view the .out file to see if it caused problems.

That's a neat trick you came up with to cope with having only 8GB RAM. As you go up in frequency, the simulation requires more resources per frequency.

The two main things that cause problems and inefficiency with NumCalc are (1) non-equilateral shaped triangles and (2) smooth and gradations from small to large triangles. On top of that, frequencies above about 16 kHz take exponentially more time and resources to process, are very likely to fail, and provide zero sonic benefit.

I'll review my scripts a bit more this week and then I can make them available for testing. I can make some adjustments so that it only goes up to 15 or 16 kHz for you, and then it can resample the results so that you don't have to do all that manual file hacking!
 
Out of interest, do these calculated (as opposed to measured) HRTFs accounts for torso and shoulder reflections?
 
Back
Top Bottom