The last few months I've put in a lot of work towards binaural virtualization. My first attempt was in 2021, when I contacted a few local anechoic chamber facilities. Each one told me they generally don't make them available to civilians. Nothing happened then because of a lack of interest on their side and high costs (thousands) even when they were willing. Also, this all-real approach was not the right one in any case, since I learned a little later that virtualization using anechoic in-ear measurements is limited to the exact position of the speaker. Many separate and exact measurements have to be made, which is not feasible if the facility doesn't already have a setup specifically for HRTF capture.
After seeing this thread and
@fcserei's successful and fairly low cost attempt, I decided to try scanning. I don't have a scanner and scanning myself seemed like too much of a pain, particularly since, if you have ever seen the raw output of handheld scanners, a lot of garbage polygons and surface weirdness are generated and have to be cleaned up. If you take many different scans to fully cover your head and torso, those individual scans have to be precisely spliced together for a complete model. Overlapping geometry, holes and twisted membranes will cause problems and have to be avoided.
I contacted an engineering firm that scans patients for prosthetics and they agreed to help. The price was $600 USD.
I was scanned in multiple passes using an Artec Spider II for my ears and Artec Leo for my torso and head. The Artec software, with a little guidance by the engineer in charge, automatically splices together the partial models generated by each pass. It specifically uses texture cues, since the scanning resolution is in the micron range. This capability is particularly helpful because I did not have to be clamped, sit in a vice, or stick tracking dots all over, all of which are common practice.
The session took two hours in a small room with a photographer's diffuse lights. I sat in a chair and was told to move as little as possible while scans were in progress. General advice was to dress in a skin tight white shirt, shave and wear a head cap. No branding or logos should be visible. Scanners in general are bad at dark and reflective surfaces, and the unevenness that comes with clothing and hair. During the session, the engineer also sprayed dust in my ears (while I had ear plugs in) to help the scanner pick up fine curves and eliminate shine. It had an especially hard time detecting the undersides of the helix (outer rim of the ear) and antihelical fold (one of the ridges in the middle), and the retroauricular sulcus (the back side of the ear and its connective tissue). He also stuck masking tape on the back of my head to flatten the hair coming out of the cap.
The raw scan compilation was filled with thousands of floating segments of visual garbage, I would guess as a consequence of the extremely high resolution of the scanners. It also included parts of the chair I sat in, the small ridges of the swim cap I wore, the folds and neckline of my unfortunately too-loose shirt, parts of what facial hair I had and the added confusion of earplugs in some passes and others without. All in all I think the software was attempting to stitch together over a dozen different passes whose coverage overlapped and which featured some small holes despite the engineer's best efforts. He would spend time processing these after I left.
I received a completed model a few weeks later. It did not have defects, but it was incredibly detailed and would need manual work before running through mesh2hrtf. At the outset the mesh count was 400k triangles.
When I started I had no experience with 3D editing software. I used a combination of Meshmixer and Blender. They are without question hard to use. Thankfully Meshmixer has a more or less intuitive adaptive smoothing function and a strong set of analytical tools for finding defects or visually displaying unusually oriented vertices. With some work I reduced the fine features of my face and torso, including rounding the flat surface sealing my truncated shoulders and chest—the HRTF simulation requires the models to be watertight (also called manifold) for the HRTF simulation.
Depending on the editing software, repairing holes and smoothing can produce insane geometries. Blender in general defaults to complex polygonal faces which best fit the selection. These then have to subdivided into triangles. Meshmixer doesn't have this issue, and defaults to triangles, but you have less fine control, and where the mesh has sharp discontinuities or edges the results are not predictable.
My first attempt to run the HRTF calculation script failed because I did not sufficiently simplify the model. At 200k triangles it refused to run.
I reduced the triangle count to around 100k after a few days of working in Meshmixer. The ears I barely touched apart from smoothing the artificially sculpted blocked canal, since the scanner wasn't able to capture much deeper than the concha (the grooved area leading to the canal, where IEMs usually sit). I then ran the mesh2hrtf optimization script, which produced individual left and right models of around 40k triangles each.
I tried calculating again but ran into a major issue. For the highest frequencies, a very large amount of RAM is required—more than the 32gb I had installed. I am decent with computers, but I don't know how to troubleshoot situations like this. The script would not recognize virtual RAM allocated on my SSD.
I took a break from messing with the model and found a local audiologist that offered ear scans using the Otoscan system instead of the usual injection molds for custom earplugs and hearing aids. I convinced her to make the scans and send me the files. These were high resolution scans of my canals to a depth of 20mm, which is fine. Normal adult ear canals are 25-30mm long. These cost around $70 USD.
I asked the engineering firm for help with splicing these with my earlier head and torso scan and they agreed to do so. Here I should add that I suspected everyone who had reported poor virtualization results either used too low of a scanning resolution or, because the scans had no ear canals, had the misfortune of needing more personalized high frequency information necessary for externalization (hearing sounds whose apparent position is outside of your head) and especially frontal localization. Others for unclear reasons did not have problems with clear frontal localization. There is evidently some level of luck involved.
I also eventually purchased and installed 64gb of RAM. My computer is a Windows 11 passively-cooled machine hiding in a low cabinet in my living room. A small, silent Noctua fan forces air out of a slotted opening on its rear side. The case and power supply are made by Streacom, the motherboard is a ASRock Z690M-ITX/ax and the CPU is an Intel i3 13100T with four cores. Without modern thermal throttling I'm sure the whole thing would have melted. 3-8 calculation instances were running at a time, consuming around 90% of RAM and CPU on average.
The calculation took around 52 hours to complete, but failed to produce usable results in the end due to what are called nonconvergence issues, where the calculation is unable to find a solution for the HRTF at a particular frequency.
These issues are caused by problems with the model geometry. To prevent them, mesh2hrtf has a mesh optimization script, but this script was clearly not meant for models with ear canals included. Normally the chosen ear is left more or less untouched while the mesh of rest of the head is simplified, especially the opposite side ear, which disappears completely apart from a few bumpy artefacts. In this case, the script generated a twisted version of the opposite-side ear canal. I think it tried to reduce it to nothing and close the hole, but couldn't because the geometry reached deeper than it expected. Manual work with Blender helped here.
I'm not completely sure what to do next. I think the nonconvergence issues, which started only at 15kHz and up, are to do with the geometry of the ear canals specifically. The furthest extremities are a little jagged and I left them that way because working with geometry positioned
inside a larger model is a struggle. I also didn't want to arbitrarily introduce new shapes, but that was probably the wrong weighing of priorities.
While Meshmixer has the better smoothing function, it does not react well to deeply concave surfaces. It will take more manual effort with Blender for a good result.
Whenever I finally get all this working I will compare the tradeoffs of head only to head and torso simulations, headtracking vs. none and visual input vs. none. The first is a matter of accuracy, while the others apparently strongly determine the believability of auditory scenes. If you know the sound will not match the space you're in, then that cognitive dissonance is supposed to be enough to ruin auditory processing and perceptual stability.
Edit: Typo.