• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

How well High-Fidelity audio aligns with human perception?

tengiz

Member
Joined
Jan 16, 2023
Messages
96
Likes
50
Location
Seattle
Coming from a background in physical acoustics and engineering - though not specifically in audio - I’ve been wondering why high-fidelity audio still seems so focused on signal accuracy - "waveform fidelity," for lack of a better term. By signal, I mean what microphones capture and what gets delivered near the ears.

To be fair, modern high-fidelity systems already account for a lot of the basics: that we hear with two ears, the limits of human hearing in frequency and dynamic range, how we perceive loudness, acceptable levels of various distortions, and so on. Perceptual insights have clearly shaped things like lossy compression, spatial audio, and room acoustics. But beyond that, I haven’t seen much that really engages with the more subtle ways we actually perceive sound.

Take harmonics, for example. Real-world sounds aren’t just single tones - harmonics are a universal property of oscillations, and any hearing system that evolved to make sense of sound likely developed around this structure. From very early on, brains - ours and those of many other animals - are wired to hear harmonically related frequencies as one sound, or as belonging together. It plays a huge role in how we recognize voices, distinguish instruments, and perceive musical harmony. So I can’t help but wonder - could that be used more directly in how we design audio systems?

I’d be curious to learn if there are any interesting efforts along those lines, or what the main challenges are.

Thanks.
 
Take harmonics, for example. Real-world sounds aren’t just single tones - harmonics are a universal property of oscillations, and any hearing system that evolved to make sense of sound likely developed around this structure. From very early on, brains - ours and those of many other animals - are wired to hear harmonically related frequencies as one sound, or as belonging together. It plays a huge role in how we recognize voices, distinguish instruments, and perceive musical harmony. So I can’t help but wonder - could that be used more directly in how we design audio systems?
That would be relevant in the creation of the content. The playback chain should be concerned with reproducing the content as accurately as possible. You also have to take into account the issues with acoustics in the listening space and the properties of drivers and their enclosures, but those are of course something that has been and is studied quite a bit.
 
Take harmonics, for example. Real-world sounds aren’t just single tones - harmonics are a universal property of oscillations,
A microphone picks-up the harmonics and all of the other simultaneous sounds (just like your ears).

The stored audio contains all of the information, and a speaker can reproduce all of the frequency components simultaneously too. Except, it's "difficult" for a single driver to reproduce the whole range so we often have 2-way speakers with a woofer & tweeter or a 3-way speaker with a woofer, midrange, and tweeter. All of the waves come-out of the drivers and superimpose to re-create the complete complex sound wave.

Obviously, we can hear more than one instrument in a recording and you can hear the harmonics & overtones that make a piano sound different from a saxophone, and that make different signers sound different when singing the exact same notes. ;)

There is a concept that you probably studied somewhere along the way called superposition. (I'm not sure if that Wikipedia article explains it that well... It's pretty simple, really.)

If you make a good recording of a singer or an instrument, etc., and play it back on a good speaker in the same location the room you can get VERY realistic reproduction. It works even better in a "lively" acoustic environment where the room has a bigger influence on what you hear.

If you had an "unlimited" budget", room acoustics are generally the "last remaining problem." An orchestra playing on your stereo in your living room is never going to sound like an orchestra in a concert hall with all of the wonderful natural reverb coming from all directions.

Of course, most modern recordings are studio creations. There was no "live performance". Even if the band was playing together at once in the studio, it probably wouldn't sound that good in the room with the musicians. It probably sounds better listening to the monitors in the control room, but even then the mix isn't done yet.

Having only 2 speakers is a limitation too. You can get closer to the "big real room" sound with surround sound* but it's hard to get rid of the small-room acoustics. The stereo "phantom center" isn't perfect either, especially if you are sitting off-center. The center channel in surround systems takes care of that.


* With regular stereo recordings I like to use a "hall" or "theater" setting on my AVR to get the "feel" of a bigger room. ...That can be considered hi-fi heresy since I'm not listening accurately as intended or as it was heard in the studio but I like it!
 
Last edited:
Sounds like you're making a case for phase accurate speakers. There are a few companies that go to a lot of trouble to keep their speakers phase accurate, which means the harmonics will coincide with the fundamental in time the way they were recorded. Most speakers aren't phase accurate except at the crossover point. That means the harmonics may come early or late. Some claim this is inaudible to humans. The argument I've heard made for it is that our brains are good at reassembling what we hear into coherent sound, so we can't exactly hear the difference, but our brains don't have to work as hard to fix the signal if the timing is correct to begin with. This can, for some people, make phase correct speakers more relaxing. I lean towards thinking there's some truth to it.
 
I’d be curious to learn if there are any interesting efforts along those lines, or what the main challenges are.

I think you need to spent some time here: http://www.davidgriesinger.com/

Recommend reading his articles and presentations while keeping in mind that whatever insights are found in terms of psychoacoustics of auditoria also apply to home hi-fi.

Chris
 
The general term for what you’re describing is sound design. An engineer uses tools to develop a loudspeaker which meets certain criteria, while the designer decides what those criteria may be. Sound designers focus more-so on psychoacoustics than physics, but the two principles (design and engineering) are not mutually exclusive.


Guitar cabs serve as a good example of this blend. The harmonics and non linear distortion products of a cab are engineered to meet a specific design goal, and this goal is often subjective. You could say the same for nearly every (non measurement) microphone.


But to more directly answer your question: yes, research continues exploring ways to better connect the source to the listener. Some examples you may have heard of include wave field synthesis, object oriented audio, “temporal” room correction, harmonic enhancement, and so on. The current trend seems to focus in on the spatial characteristics of audio.
 
could that be used more directly in how we design audio systems?
Short answer: perhaps if we focused more on reducing THD and IMD, but speakers that perform well with tones tend to reproduce harmonic series almost as well.
 
Take harmonics, for example. Real-world sounds aren’t just single tones - harmonics are a universal property of oscillations, and any hearing system that evolved to make sense of sound likely developed around this structure. From very early on, brains - ours and those of many other animals - are wired to hear harmonically related frequencies as one sound, or as belonging together. It plays a huge role in how we recognize voices, distinguish instruments, and perceive musical harmony. So I can’t help but wonder - could that be used more directly in how we design audio systems?

I recommend slide 17ff, that explains why phase fidelity of loudspeakers and listening rooms is important.

More here on the effect of phase fidelity of loudspeakers and controlling early reflections. Note that most loudspeakers and listening rooms are not set up for listeners to hear these effects. It was quite startling when I flattened the phase response of the loudspeakers (all 5.1 channels) and then experienced the effect in a listening room treated for early reflections.

Refer to the cyan trace below (early decay time: EDT) in the reverberation time (RT) plot, showing the effect of full-range controlled directivity down to the room's Schroeder frequency (~100 Hz) showing control of early reflections:

1730552276_ChrisAListeningRoom-1MRight.jpg.cb801c9b8efcab79657959db5e9371bc.jpg


More effects of flattened phase loudspeaker/controlled early reflections from Linkwitz's site: https://www.linkwitzlab.com/Attributes_Of_Linear_Phase_Loudspeakers.pdf

Chris
 
Last edited:
Sounds like you're making a case for phase accurate speakers.

I recommend slide 17ff, that explains why phase fidelity of loudspeakers and listening rooms is important.

More here on the effect of phase fidelity of loudspeakers and controlling early reflections. Note that most loudspeakers and listening rooms are not set up for listeners to hear these effects.

Yes! Phase alignment - excellent example - whether between speaker drivers or between fundamentals and their harmonics, does seem to matter perceptually. And yet, in real rooms with real speakers, those phase relationships often get scrambled by reflections and other interactions. Still, the sound is usually perceived as coherent and stable.

If the waveform itself isn’t really preserved by the time it reaches the ears, then what is? What aspects of the sound are still getting through in a way that lets us recognize voices, follow melodies, or localize sources reliably? Are there certain patterns or relationships - timing, frequency content, something else - that our hearing is tuned to pick up even when the signal is heavily altered?

If so, could those features be identified and used more deliberately in system design? Not just focusing on reproducing the exact waveform, but on preserving the parts of the sound that perception actually depends on - the ones that tend to hold up better in real-world conditions.
 
A microphone picks-up the harmonics and all of the other simultaneous sounds (just like your ears).
...
Right, while agreeing with what you've said, what I meant about the harmonics was somewhat different - of course if you preserve the waveform, you preserve the harmonics too.

But the ear and brain seem to have a pretty specific way of analyzing sound, shaped by millions of years of evolution. Our auditory system is tuned to pay close attention and pick up on particular relationships between frequencies. Otherwise the harmony/music would not be possible.
 
If the waveform itself isn’t really preserved by the time it reaches the ears, then what is?
In the examples I gave above, the "waveform" IS largely preserved--to the degree that the effect I mentioned becomes audible. And this is attainable nowadays if the loudspeakers are chosen carefully (or dialed-in carefully via DSP), and minimal acoustic treatment applied in-room.

I've found that three conditions must be true for waveform fidelity to be preserved enough for this psychoacoustic effect to occur:
  1. Full-range directivity control of the loudspeakers--down to the room's Schroeder frequency (at which point directivity no longer starts to be perceived in small listening rooms)

  2. Flat phase and amplitude response of the loudspeakers

  3. Control of early reflections just around the loudspeakers and the listening positions, including floor bounce.
If you read the references I linked above, you will see links to the Greenfield and Hawksford JAES article on the audibility of loudspeaker phase distortion--which demonstrates the audibility of phase flattening...and the difficulty in hearing the effects if the source music is not chosen such that it has preserved phase fidelity, the loudspeakers do not have full-range directivity control, and early reflections are not necessarily controlled enough in-room. Even though the authors apparently failed to fully control these areas for the experiment, some of the observers were clearly able to pick out the flattened-phase cases easily.

celestion-d700.jpg


Chris
 
Last edited:
Coming from a background in physical acoustics and engineering - though not specifically in audio - I’ve been wondering why high-fidelity audio still seems so focused on signal accuracy - "waveform fidelity," for lack of a better term. By signal, I mean what microphones capture and what gets delivered near the ears.

To be fair, modern high-fidelity systems already account for a lot of the basics: that we hear with two ears, the limits of human hearing in frequency and dynamic range, how we perceive loudness, acceptable levels of various distortions, and so on. Perceptual insights have clearly shaped things like lossy compression, spatial audio, and room acoustics. But beyond that, I haven’t seen much that really engages with the more subtle ways we actually perceive sound.

Take harmonics, for example. Real-world sounds aren’t just single tones - harmonics are a universal property of oscillations, and any hearing system that evolved to make sense of sound likely developed around this structure. From very early on, brains - ours and those of many other animals - are wired to hear harmonically related frequencies as one sound, or as belonging together. It plays a huge role in how we recognize voices, distinguish instruments, and perceive musical harmony. So I can’t help but wonder - could that be used more directly in how we design audio systems?

I’d be curious to learn if there are any interesting efforts along those lines, or what the main challenges are.

Thanks.
ASR member JJ Johnson has given lectures on the topic at the Seattle AES chapter. They are on YouTube under DansoundSeattle. I think, as you say, the main use is compression and much of that is based on psychoacoustic masking. AES has an area https://www.audiosciencereview.com/forum/index.php?forums/psychoacoustics-science-of-how-we-hear.20/ . You never know what you might find in a patent search.
 
But the ear and brain seem to have a pretty specific way of analyzing sound, shaped by millions of years of evolution. Our auditory system is tuned to pay close attention and pick up on particular relationships between frequencies. Otherwise the harmony/music would not be possible.

Floyd Toole is talking about the measurement of speakers in a room and he says:
Two ears and a brain are massively more analytical and adaptable than an omnidirectional microphone and an analyzer.
And...
(Automatic room correction) can yield improvements at low frequencies for a single listener, but above the transition frequency to claim that a smoothed steady-state room curve derived from an omnidirectional microphone is an adequate substitute for the timbral and spatial perceptions of two ears and a brain is absurd.
With your background you might enjoy Dr. Toole's book: Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms. (A new revision has been announced so you might want to pre-order or hold-off.)

Or Ethan Winer has a book called The Audio Export which covers a broad spectrum of audio reproduction topics. He's also got a little article on his website called Audiophoolery where he talks about the characteristics of sound quality.

The goal of high fidelity is to accurately reproduce the soundwaves... within the limits of our hearing. Then it doesn't matter what the ear & brain are doing. "Audiophiles" tend to over-estimate, and over-imagine, what we can hear and that's one reason blind listening tests can be useful.

BTW - Phase is one of those things that turns-out to be not that important, except when the same frequencies are combined in-and-out of phase with each other. For example if you reverse the + & - connections to one speaker to get a phase/polarity inversion, you get almost complete cancelation of the bass and some other "phase weirdness". But if you flip the connections to both speakers they are back in-phase with each other and everything sounds normal again.

As you may know, the the mixing of direct soundwaves with reflected and delayed soundwaves and the associated phase shifts are a major issue in room acoustics. And it turns out to be a major issue with small* rooms, especially in the bass range where you get out-of-phase subtraction/cancelation at certain frequencies and certain places in the room, and in-phase summation at different frequencies and locations. But that's not really a reproduction problem... You have the same issues with a live musical instrument or a speaker.

* By "small", I mean any room in a house which is much smaller than a live performance space.
 
Sounds like you're making a case for phase accurate speakers. There are a few companies that go to a lot of trouble to keep their speakers phase accurate, which means the harmonics will coincide with the fundamental in time the way they were recorded. Most speakers aren't phase accurate except at the crossover point. That means the harmonics may come early or late. Some claim this is inaudible to humans. The argument I've heard made for it is that our brains are good at reassembling what we hear into coherent sound, so we can't exactly hear the difference, but our brains don't have to work as hard to fix the signal if the timing is correct to begin with. This can, for some people, make phase correct speakers more relaxing. I lean towards thinking there's some truth to it.
Find a piano or a keyboard. Play your favorite chord 10 times in a row as identically as you can. Same attack, same timing, same sustain. It should sound remarkably similar.
Now realize that every time you played that chord you heard a "coherent" result, but the relative phases between your notes were COMPLETELY random because you don't have enough control over the timing to get the same result twice. If our brains were sensitive to the type of phase distortion you're concerned with, then the relative phase of the notes in a chord would make the chord sound different every time we heard it, and songs would be unpredictable because musicians would be incapable of consistently generating the sound they want.
 
In the examples I gave above, the "waveform" IS largely preserved--to the degree that the effect I mentioned becomes audible. And this is attainable nowadays if the loudspeakers are chosen carefully (or dialed-in carefully via DSP), and minimal acoustic treatment applied in-room.
...
Yes, I understand what you mean, but isn’t it the case that in a small room - not a concert hall - even with perfect electronics and speakers, the waveform only stays intact for a few milliseconds before reflections start mixing in? That clean window is really short, but speech and music stretch over much longer times. So after those first few milliseconds, we’re already hearing something pretty different at the ears.

Sure, the stronger the direct sound, the smaller the effect, but even a small amount of reflection can start to alter phase in a frequency-dependent way. And yet we still perceive the sound as stable. So what’s actually being preserved? Somehow the brain still pulls something useful out of it.

I think I get why speaker phase distortion is audible. The room will always do its thing, but maybe we rely more on that initial clean bit. If that part is already messed up - like with driver phase misalignment - maybe the added smearing from the room just pushes perception over the edge. So maybe it's not just that phase distortion is audible in general, but that it messes with the one reliable moment we have before the chaos starts.

I also wonder if the kind of distortion matters. Room effects like reflections, comb filtering, modes, reverb, decay, etc. are at least physically "natural". But what electronics do - take FIR filters, steep crossovers - feels less so. Maybe the brain is just better at untangling "natural" distortions than the kind of artifacts that come from "less natural" signal manipulation.
 
Coming from a background in physical acoustics and engineering - though not specifically in audio - I’ve been wondering why high-fidelity audio still seems so focused on signal accuracy - "waveform fidelity," for lack of a better term. By signal, I mean what microphones capture and what gets delivered near the ears.

To be fair, modern high-fidelity systems already account for a lot of the basics: that we hear with two ears, the limits of human hearing in frequency and dynamic range, how we perceive loudness, acceptable levels of various distortions, and so on. Perceptual insights have clearly shaped things like lossy compression, spatial audio, and room acoustics. But beyond that, I haven’t seen much that really engages with the more subtle ways we actually perceive sound.

Take harmonics, for example. Real-world sounds aren’t just single tones - harmonics are a universal property of oscillations, and any hearing system that evolved to make sense of sound likely developed around this structure. From very early on, brains - ours and those of many other animals - are wired to hear harmonically related frequencies as one sound, or as belonging together. It plays a huge role in how we recognize voices, distinguish instruments, and perceive musical harmony. So I can’t help but wonder - could that be used more directly in how we design audio systems?

I’d be curious to learn if there are any interesting efforts along those lines, or what the main challenges are.

Thanks.
In general, our audio systems capture all the relevant harmonics on the recording side, and play them back with sufficient accuracy on the reproduction side that there are no issues.

However, the noise reduction algorithm I developed differs from the industry standard spectral subtraction in a number of ways. One of those ways is that I pay attention to harmonic relationships. If I find a frequency bin in the magnitude range where it's difficult to tell if it's noise or signal (at or just above the noise floor estimate), I check to see if it's a harmonic of a lower frequency tone that significantly exceeds the noise floor. If it is, then I reduce it to a lesser degree than if it wasn't harmonically related to any tones in the music at that point in time. This allows my algorithm to reduce the noise further without removing the harmonic content that contributes to the timbre of the instruments. The result is a more natural sound. Of course, the downside is that my algorithm, with its various improvements achieves a slightly better result, and takes 100 times longer to do it.
 
If I find a frequency bin in the magnitude range where it's difficult to tell if it's noise or signal (at or just above the noise floor estimate), I check to see if it's a harmonic of a lower frequency tone that significantly exceeds the noise floor. If it is, then I reduce it to a lesser degree than if it wasn't harmonically related to any tones in the music at that point in time. This allows my algorithm to reduce the noise further without removing the harmonic content that contributes to the timbre of the instruments. The result is a more natural sound. Of course, the downside is that my algorithm, with its various improvements achieves a slightly better result, and takes 100 times longer to do it.
Awesome — this is exactly the kind of improvement I hoped would be possible! Yes, I get that the modest improvement burns a lot more CPU cycles, but as cycles keep getting cheaper — and shorter in duration — it makes sense to keep doing more of these nuanced tricks. I also assume this approach could result in better lossy compression as well?

I wonder if this could also work for probing stronger lower harmonics, rather than just looking for a possible fundamental when it's “naturally” weak or missing. For example, if you’re analyzing a 3kHz component, could it make sense to check whether there's a strong component at 2kHz, regardless of whether the fundamental at 1kHz is present? I mean, assuming the effective frequency resolution of the bins allows for it.

I realize this kind of probing could quickly multiply the combinations to check — and burn even more cycles — but it feels like a promising direction if the goal is perceptual realism over strict signal preservation.
 
... I’ve been wondering why high-fidelity audio still seems so focused on signal accuracy - "waveform fidelity," for lack of a better term. By signal, I mean what microphones capture ...
You're wondering about the hifi fans. It's not done that way in the recording studio, only the adverts for playback devices ask this question. For understandable reasons, but certainly not to improve the world.

To be fair, modern high-fidelity systems already account for a lot of the basics: ... I haven’t seen much that really engages with the more subtle ways we actually perceive sound.
So you've brought me back once again. I'm off again in a minute. I sincerely hope that the universities don't spend my tax money on hi-fi research. We have enough people with hearing problems in everyday life. I'm happy to pay for help for them.

Take harmonics, for example. ... So I can’t help but wonder - could that be used more directly in how we design audio systems?
...
I’d be curious to learn if there are any interesting efforts along those lines, or what the main challenges are.
It's not clear to me where there could be any improvements!

In general, the production chain of hi-fi consumer recordings is not considered as a whole. It's shortened to holding up a microphone and playing it back at home. This is quite simple, but easy to understand, and then the idea of an unaltered waveform also catches on. You can't ask your question like that.

I personally assume that I have to be subjectively active in order to understand a recording, just like I read a book. By this I mean that the design of a recording in all the parameters of its composition must allow me to extract meaning from it. The playback technology only plays a secondary role, the first role is played by my mind. I'll leave it at that.

Awesome — this is exactly the kind of improvement I hoped would be possible!
Well, maybe this can be added to the MP3 standard. However, I don't think that digging through the noise in search of a treasure will make it any easier to understand a recording.

***
I fear that there is also a misunderstanding here; could it be that in (am.) English the word “science” is also used for engineering?

The engineer uses known information under strict rules. The scientist scrutinises with subtle logic, which is a huge difference not only in (europ.) linguistic usage. So long.
 
I fear that there is also a misunderstanding here; could it be that in (am.) English the word “science” is also used for engineering?

The engineer uses known information under strict rules. The scientist scrutinises with subtle logic, which is a huge difference not only in (europ.) linguistic usage.
1. Real engineers don’t need induction or deduction - they need production.
2. Science is the art of satisfying private curiosity on the taxpayer’s dime.
 
Back
Top Bottom