- How does BACCH enhance the spatial realism in the reproduction of acoustical recordings made in real acoustical environments?
Crosstalk cancellation (XTC) techniques, such as BACCH, suppress the sound recorded on the left (right) channel of a stereo recording at the right (left) ear of the listener during stereo playback from a pair of loudspeakers. This cancellation raises the limit on the level of interaural level difference (ILD) and interaural time difference (ITD), above the levels that the speakers can deliver without XTC, allowing more of the correct spatial cues of the recorded sources to be reproduced at the ears of the listeners. (ILD is the difference between the sound pressure (in dB) caused by a given source at one ear minus the pressure at the other ear. The ITD is the difference between the sound arrival times at the two ears. Both are generally frequency-dependent functions.)
This is best illustrated by first considering the case of ILD (the case of ITD will be discussed subsequently) and acoustical stereo mic recordings in a real space.
Most, if not all, of the statements below can be verified by the experimentally-minded reader using the BACCH-BM microphone and the recording and extensive measurement capabilities of the BACCH-dSP application at the heart of Theoretica’s BACCH4Mac packages.
It is insightful to first consider the general case of a Binaural dummy-head recording, then it would become easier to understand the more particular case of regular stereo miking techniques. The latter are of two general types: Type A stereo miking techniques, that rely on ITD to code the stereo image (e.g. ORTF, XY, and other coincident mic techniques, etc..) and Type B stereo miking techniques that rely on the ILD to code the stereo image (e.g. spaced omni or A-B mics, Decca tree, Jecklin disk, etc…)
Binaural Recordings:
The general case of binaural dummy head recording is the most natural (i.e. most akin to how humans hear) as it captures both the ILD and ITD cues, as well as the so-called spectral cues (which are associated with the non-flat frequency response imposed by the diffraction of the sound waves around the torso, head, and pinnae of the dummy, or human, head wearing the in-ear microphones.) This individualized frequency response, which helps the brain-ear system locate sound sources according to the tonal coloration the listener’s particular brain-ear system expects, becomes flatter as the frequency lowers due to the wavelength becoming larger than the objects (the torso, head, and pinnae) the sound is diffracting about. These “spectral cues”13 are used by the human ear-brain system, in addition to the ILD and ITD cues, to locate sound sources
Let us consider the case of recording a performer on a stage in a real hall. Using a dummy-head binuaral microphone (or a human wearing in-ear mics) we would capture all three types of cues (the ILD, ITD and spectral cues) on the two channels of the stereo recording. Say the performer is located at an azimuthal angle of 50 degrees to the left of the dummy head. If one measures the ILD caused, at the dummy’s ears, by a sound source located at that location (such calcualtion consists of subtracting the SPL measured at the right ear from that at the left ear) one would find, on the average, about 8 dB (strictly speaking this depends on the distance and frequency, which for the sake of illustration, we would take to be about 10 feet and 1 kHz, respectively). If the performer, while performing (e.g. clapping), moves to the center position facing the dummy head, the ILD would drop to 0. If she moves further to the let, the ILD increases and can easily exceed 8 dB. If she approaches the recording head from the left, the ILD would build up further (due to the enhanced effect of the head shadowing the right ear) and can reach as high as 20 dB if the the performer gets very close to the left ear (since most of the sound will be blocked from reaching the right ear). As a thought experiment, let us record, using the dummy head mic, the performer as she moves (while performing) from the center position to the left position (50 degrees), and then walks to the recording head and whispers in its left ear.
For a stereo playback system to be able to reproduce this entire spatial image accurately from the above-described recording, it must reproduce this entire range of ILD, from 0 to 20 dB at the ears of the listener. We shall now explain why a regular stereo system cannot do so without XTC.
The problem with “regular” stereo playback system (as opposed to one with XTC) is that the maximum ILD it can deliver is that produced by the left (or right) speaker, which for a regular stereo (equilateral) triangle is about only 3-5 dB at 1 kHz (depending on the radiation pattern of the speaker and the distance of the listener from the speakers). This number can be easily verified by putting a test signal (1 kHz sinewave, or pink noise) in the left channel (and only the left channel), measuring the SPL at the left ear and subtracting from it the SPL measured at the right ear. (which can be easily done using sine sweep in BACCH-dSP to produce a plot of the entire ILD spectrum over the entire audio band.). The plot below shows such a typical measured ILD spectrum made with BACCH-dSP through a typical stereo system in the “regular stereo triangle” (+/- 30 degrees) configuration.
The black (red) curve represents the measured ILD spectrum of the left (right) speaker at the ears of the listener. Note that at 1 kHz, the ILD is about 5dB. At higher frequencies, head shadowing (which acts as a “natural XTC”) causes the ILD to rise a bit (as clearely seen in the plot), but the most important content, perceptually, (especially human voices) is below 1 kHz. Therefore, a listener listening to the recording we made above would hear the performer move from the center towards the left speaker, then gets “stuck” at the left speaker as the recorded ILD in the recording builds up above 5dB, since the reproduced ILD at the listener’s ears cannot exceed 5 dB. This should illustrate clearly the fundamental flaw in speakers-based spatial audio reproduction without XTC. [See Footonote14 for an additional, more subtle, flaw].
XTC can remove this limitation. In particular, BACCH can deliver the maximum possible level of XTC (with zero added tonal coloration) for a given pair of speakers in a given room based on a measurement of the two-point HRTF15 of the listener with the calibrated BACCH-BM microphone. The resulting ILD spectrum (which, by definition, is the same as the XTC spectrum) is shown in the figure below for the same audio system:
It should be clear from this plot that BACCH can deliver, for the same audio system, 15 dB ILD at 1 kHz, with ILD levels well exceeding 20 dB, at the ears of the listener sitting in the sweet spot (the location where the HRTF measurement was made.) Therefore, the performer would now be perceived to walk all the way from the center, way past the left speaker, to an azimuthal angle of 50 degrees, then walk towards the listener and whisper in his left ear, much like in the real life event. This is the case irrespective of the location of the speakers, as long as the BACCH filter used during playback was designed for that particular speakers-listener configuration. (Incidentally, BACCH-dSP has a simple easy-to-use binaural recorder that allows you to verify the above by quickly making such a recording of a performer walking around you with the BACCH-BM in your ears, then immediately listen to it through a BACCH filter.)
Now that, we hope, this is all clear for a dummy-head recording, it is easy to explain how a similar enhancement to the accuracy of spatial reproduction can be attained for a recording done with a regular stereo mic pickup.
Type A Recordings:
Stereo recordings done with a "Type A microphone" (ORTF, XY, coincident mic techniques) rely on mic capsules with directional pickup patterns (cardioid, hypercardioid, etc.) oriented in such a way to proportionally attenuate the sound of a source located the right (left) side of the microphone as it reaches the left (right) capsule. Therefore, it is mostly capturing the “ILD” (and in the case of a coincident stereo microphone, only the ILD). Although this “ILD” may be a bit different from the actual ILD a dummy head would capture (since the attenuation imposed by the highly directive capsules may not accurately represent the attenuation due head shadowing), it is fully capable of capturing a good part, if not all, of the wide range ( 0-20 dB) of our proverbial walking performer. Again, a stereo system without XTC will only be able to reproduce a small part of that range (up to about 5 dB) and again, the performer will be stuck at the left speakers as soon as she reaches about 30 degree azimuth to the left, and remains there throughout the rest of the recording, while in real life she was walking well past the angle (to 50 degrees) then towards the left side of the microphones. Again, the same stereo system with the BACCH filter whose measured XTC performance is shown in the plot above, can reproduce virtually the entire range of ILD, and thus can give the listener a far more accurate spatial reproduction of the full spatial image.
The difference between a binaural recording done with a dummy head, and a stereo recording done with Type A stereo microphone, when rendered through the same BACCH filter whose XTC performance is shown in the plot above, is that the one-to-one spatial correspondence between the real image and perceived image is more accurate for the former (since the ILD is coded with the attenuation due to a human head shadowing) than the latter (since the ILD is coded with the particular attenuation due to the directivity pattern of the capsules in the Type A stereo mic). However, they both give a spatial image (through the same BACCH filter) that is far more accurate and realistic than of playback without XTC.
Type B Recordings:
Since "Type B" stereo recording techniques (e.g. spaced omnis) use omni-directional microphones, they rely on spacing the two microphone capsules some distance apart to pick up ITD cues (the captured ILD cues being negligible16). At first look one might (wrongly) suspect that stereo recordings done with such a stereo microphone might not benefit from XTC during playback as much as Type A or binaural recordings, since XTC only affects the level of the sound pressure at the ears. But in fact, the delay between the arrival times of a source’s sound at the left and right capsules of the microphone will not be reproduced correctly at the ears of the listener if crosstalk is present, as explained in the next (long) paragraph.
To understand why this is the case, consider again the performer moving from the center position, where ITD is 0, to 50 degrees azimuth left while clapping her hands. A typical ITD for a source there would be something like 400 microseconds. Now if that recording of the performer clapping at 50 degrees azimuth is played back through a pair of stereo speakers, the level of the clap sound is the same on both channels (because there is little if any ILD captured by the Type B stereo microphone) but the clap on the right channel is delayed by 400 microseconds with respect to the right channel of the recording. Therefore, the sound of the clap will arrive at the left ear from the left channel first, then, after a delay time of t1 microseconds, that same sound wave will reach the right ear (t1 is the ITD that would be caused at the ears of the listener by a source located where the left speaker is located, i.e., at 30 degrees azimuth. It should be clear that t1 would be significantly less than the ITD of a sound source at 50 degrees (400 microseconds)). If, hypothetically, there is no sound from the right speaker, the listener would hear the clap coming from the location of the left speaker (which, at 30 degrees azimuth, is not the correct 50 degree azimuthal location of the real life clap). However, the right speaker will emit the clap recorded on the right channel 400 microseconds after it was first emitted by the left speaker. This same sound will reach the left ear t1 microseconds later (again If, hypothetically, there was no emitted sound from the left speaker the listener would hear the clap coming from the location of the right speaker) causing an ILD of t1, which is wrong in value, and also on the wrong side of the listener! However, due to the Hass precedence effect, the two sounds (emitted from the left and right speakers) are perceived as fused into one, and the ITD caused by the first one (from left speaker) dominates perceptually, as it arrived first, causing the listener to perceive the sound of the performer clapping to be essentially located at the left speaker, which is 30 degrees, and not the correct 50 degrees we seek [see Footnote 17 for a more accurate description of the net effect of the “fusing” of these two sounds].
In contrast, if the crosstalk is cancelled, the left ear (and only that ear) would hear the clap emitted from the left speaker, then the right ear (and only that ear) would hear the clap from the right speaker delayed by 400 microseconds resulting in the correct ITD at the ears, and thus allowing the listener to perceive the correct real-life location of the performer, irrespective of the location of the speakers (again, assuming that the BACCH filter corresponding to that speakers-listener configuration is used).
You can easily verify the above claim that XTC improves the spatial accuracy of Type B recordings using BACCH-dSP: First, make a recording of someone walking speaking or clapping around you while you have the BACCH-BM microphones in your ears. This first recording would be the reference binaural recording. Then make a second recording of the same performance, but this time hold each of the two capsules in each hand, spaced about 6 inches apart. Since the BACCH-BM capsules are essentially omnidirectional, this is tantamount to a "Type B recording" (spaced omnis). After the recrordings are done, play the reference binaural recording while toggling the BACCH filter on and off (which is in BACCH-dSP can easily be done by a tap of the mouse) and observe how the spatial accuracy is greatly improved when the BACCH filter is on. Finally, play the Type B recording while toggling on/off tthe BACCH filter, and you will also hear a significant enhancement in the spatial accuracy when the BACCH filter is on, as discussed above.
In conclusion XTC greatly benefits the spatial accuracy, not only the speakers-based playback of binaural recordings, but also those of Type A and Type B recordings, (and therefore of virtually of all well-made stereo acoustical recordings in real acoustical spaces) as it allows both the ILD and ITD cues to be reproduced more correctly at the ears. If XTC works only for binaural recordings, as some people who have not carefully listened to proper XTC have wrongly surmised, no one would be interested in BACCH, as binaural recordings are a very miniscule fraction of available commercial recordings.
There remains the important question of whether XTC can benefit the spatial rendering of recordings that are produced “artificially” by mixing audio stems (which is the vast majority of popular music). This question is addressed in the following section (to be added very soon).
- How does BACCH enhance the spatial imaging of "studio-mixed" recordings without altering the sound intended by the mixing engineer?
In light of the arguments in FAQ #14 above, we can now address the case of “studio-mixed” recordings, which represent the vast majority of commercially available recordings. In such recordings, the mixing engineer (and sometimes with input from the artist(s) and/or producer(s) and, to a lesser extent the mastering engineer,) concoct an artificial stereo image from stems (most often mono stems) mostly through level panning (and, much less often, time or phase panning) between the left and right channels. Mixing to produce a realistic, pleasing or engaging stereo image is an art involving both technical knowhow and esthetic decisions.
Many mixing engineers are truly ingenious masters. It goes without saying that their final product deserves the utmost respect and that a good hi-fi reproduction system should not degrade or fundamentally alter their construct. It is also very true that virtually all commercially available mixed recordings were mixed while monitoring on monitors without XTC.
Depending on the techniques used and esthetic decisions made, these concocted recordings range over a wide spectrum: on one end of the spectrum are recordings aiming to emulate a real acoustic environment (e.g. a jazz club). Let us call this end of the spectrum the “pseudo-realistic end”. On the other end of the spectrum are recordings that have no binding ties to realism, and instead aim to evoke sensations, or project certain esthetic expressions (e.g. the chimes in Pink Floyd’s well-known Time track on their Dark Side of the Moon album). Let us refer to this end of the spectrum as the “artificial end”.
We will now consider what happens when such recordings are played back through XTC.
On the pseudo-realistic end of that spectrum, most of the arguments made in FAQ#14 above hold, to some extent, since the mixing engineer is essentially using at least an analog of ILD and ITD to produce a “realistic” stereo image like a stereo mic would, and all that XTC does is remove the artificial cieling on the ILD and ITD limits imposed by the speakers during playback. Most relevant in this context is reverb. During mixing, reverb is added algorithmically or through convolution with a real space impulse response (with the latter technique yielding far more realistic reverb). In both cases XTC unlocks the perceived reverberation from the speakers and project it into 3D space. It does so because the perception of a realistic 3D reverb is caused by late reflections (the diffuse field) arriving at the left and right ears at almost random arrival times (i.e. with low L-R correlation, in the parlance of acoustics) and without XTC the sound at the right and left ears would be highly corelated since the sound from each of the L or R channels reaches both ears. Such highly L-R corelated sound causes the listener to perceive the reverb to be largely restricted spatially a region that is mostly where the speakers are. It is hard to imagine a mixing engineer who would object to his mix reproduced with a reverb that is more 3D and less “stuck to the speakers” (as long as the tonal and level balance between the direct and reverberant sound is not altered. (BACCH is a patented form of advanced XTC that causes no alteration whatsoever to that balance as described in this standard, but highly technical book chapter.) In fact, one of the most noticeable and striking aspects of listening through a BACCH filter for the first time is the immediate sense of being in a real 3D space due to the higher L-R sound decorrelation that reverb is meant to cause at the ears.
On the “artificial end” of the studio-mixed recordings spectrum defined above, the mixing engineer concocts an image whose panned sources constitute an artificial stereo image that does not aim to be a reflection of a reality, but rather an esthetic or artistic construct. While mixing that image the engineer is choosing to place sources in a space that is largely between the two speakers. However, as is well-known by audiophiles, even a stereo playback system without XTC can image in a 3D, albeit relatively restricted, spatial region around the speaker (often called “the soundstage”). The main reason such imaging occurs without active XTC is because the listener’s head, by shadowing the contralateral ear from the loudspeaker (i.e. the speaker on the opposite side) creates a natural crosstalk cancellation that is highly effective at higher frequencies (i.e. frequencies whose wavelengths are smaller than that of the human head). It should be clear that this natural XTC (which can be seen in the measurement shown in the first plot in FAQ#14) depends on the span between the speakers, the distance between the head and the speakers, the radiation pattern of the speakers, and the extent and relative strength of reflections in the room. A larger speaker span, a shorter distance to the head, a more directive speaker, and a higher ratio of direct-to-reflected sound, all lead to higher values of this natural XTC. This is mainly why different stereo systems in different rooms with different listener-speakers placements, can achieve different levels of “3D imaging”.
A mixing engineer in a given studio with a certain set of stereo speakers concocts a stereo image while hearing a soundstage the spatial extent of which depends largely on the above listed parameters of the particular monitoring setup in the studio. An audiophile playing back the resulting recording through a good hi-fi stereo system at home has generally no way of knowing what these parameters were when the mix was produced, but still strives to get a good measure of a 3D soundstage. Indeed “3D soundstage” imaging of a playback system is one of the holy grails for audiophiles and audio critics. By choosing and tuning his gear and listening room to enhance such soundstage the audiophile does not betray the intent of the mixing engineer as long as the enhancement of the spatial extent of the soundstage does not come at the expense of a change in the spatial balance or tonal content of the recording during playback. It is very possible that the 3D imaging of an audiophile’s playback system has significantly better 3D imaging capability than that used by the engineer while monitoring the mix. No one would object if this were the case, or accuse the audiophile of betraying the engineer's intent.
For such recordings (on the “artificial end” of the spectrum,) XTC cannot pretend to enhance realism during playback since the stereo image was artificially concocted in the first place. However, like in the case of natural XTC, adding more XTC actively to enhance the spatial extent of the soundstage, without altering the balance or tonal content of the recording, (which is the essential characteristic of BACCH XTC) does not strictly betray the intent of the mixing engineer since the spatial extent of the artificial soundstage was not prescribed by him. Of course, this argument becomes more tenuous if XTC leads to extreme spatial panning, which can only happen for hard left or right panned sources in the absence of reflections (e.g. in an anechoic chamber, a hard left or right panned sound source played back through a pair speakers with high levels of XTC, without any ILD or spectral cues added to the sound, would lead to the sound being perceived to be very close to the left or right ears of the listener, as if wearing headphones). Such extreme imaging does not occur in real listening rooms with typical levels of direct-to-reflected sound ratio.
Of course, the level of active XTC during playback can be dialed down (in BACCH-dSP there is an “XTC percentage” slider that allows doing just that) but it should be clear from the above arguments that this is not recommended for acoustic recordings or for recordings on the “pseudo-realistic end” of the “studio-mixed” recordings spectrum. Moving towards the “artificial end” of the spectrum, the question of betraying the original intent of the engineer does indeed become a valid objection, but only to the extent to which XTC alters the tonal character and spatial balance of the recording (which BACCH, by design, does not do at all) and to the extent to which high levels of XTC can result in jarring extremely panned images, which can occur with BACCH but only in near-anechoic environments and with recordings having extremely panned mono images. The latter issue can be addressed by dialing back the XTC level (or in extreme but very rare cases, by bypassing XTC!).