Who is receiving sound "that is phase and time aligned"
If you are talking about direct sound solely, it means you are talking inner phase relations or group delay (that is how I understand the comment). Phase and delay differences need at least two reference points. And if we look at what is audible under lab conditions regarding group delay and which phase shifts are caused by existing gear, my guess would be that very low, artificial bass events (like bassdrum sounds or techno beats) with a very distinct time relation to their harmonics (the ´click´ part in the sound) are the most likely to show audible differences.
humans have the cognitive ability to separate direct sound from summed-with-reflections sound, and will perceive the sound as erroneous if the direct sound from the speakers is not well-balanced, even if the summed sound is not badly balanced. Therefore, getting the direct (anechoic) speaker response to sound well-balanced is important.
While I fully agree with Dr. Toole´s claim, I do not think it is sufficient to have only the direct sound balanced. Once our brain has a tonal reference (the direct sound) of the timbre, it is pretty likely to recognize spectral deviations in the early reflections as well as in the diffuse, later reverb. That's why we can easily differentiate binaural recordings from a glass hall vs. a wood-paneled concert hall. I have been in a diffuse field simulator creating such reverb artificially, and it is pretty astonishing what our brain can understanding about the room´s properties just by changing the initial delay and the tonality.
humans very strongly prefer the sound of a classical music performance in a hall with good acoustics more than outdoors.
Which non-amplified classical performances have you attended under free-field conditions? I have, quite a few, including operas and oratorio performances. Cannot say that people prefer halls necessarily, it is just very different, less loud, timbrally thin and close and distant at the same time. Instruments and voices heavy on middle frequencies, depending on early reflections, seem to have the maximum of problem with this, during general rehearsals I noticed first and foremost contraltos, tenors, horns and clarinets struggling. If is impossible to do a recording of this btw as it would not tell us anything about the room.
I want that acoustic energy to exactly match the electrical signal given to the speaker.
The transducer properties will render this impossible. Turning a signal into soundwaves, it not a mathematical process. You can only approximate the electrical signal and choose which alterations are acceptable.
we never had the ability to make such phase and time aligned technically excellent speakers before. Muti-way active DSP has truly changed what speakers are capable of.
Phase and time alignment at one single point does not necessarily help, it has to be even over a greater window both for direct sound and early reflections.
A few speakers can come pretty close to that, if they employ DSP x-over, FIR filters with a linear-phase mode, and sufficiently large drivers/low crossover points. They are around for something like 20 years. Encourage everyone to listen to such a unit, if possible switching to minimum phase mode. Differences are not huge, most obvious in the lower bass with artificial sounds like EDM. Interestingly these bass differences survive even horrendous room-induced alterations in a listening test, according to my experience.
a) the direct sound should be time-coherent north of about 500 Hz in order for the overtones to arrive simultaneously, resulting in these peaks;
In theory I agree, but in practice the necessary coherence above 500Hz it not as difficult to achieve. At 500Hz, we are not very sensitive to time issues, even if they pose an interaural difference for which our brain is more sensitive. Try to localize a male singer singing a long ´O´ or ´A´ without consonants! Pretty difficult, we are used to localize them by bright vowels or brilliance-heavy consonants.
(Elsewhere Griesinger mentions 700 Hz and 1000 Hz as being the lower end of the frequency region that matters most to the ears, giving me the impression that the octave between 500 Hz and 1 kHz is a somewhat fuzzy transition region.)
Absolutely correct regarding audibility of time differences. This band (400-800Hz or 500-1K) is, on the other hand, very important for tonality, as our brain seems to be pretty sensitive to level differences between overtones.
Group delay is even more likely to be inaudible below 1K, and above that, a loudspeaker designer needs to do pretty strange things to cause linear distortion in the range of milliseconds.