You have understood the first part - that having a flat frequency response and adequate bandwidth are basic necessities to reproducing the combined pressure waveforms created when sounds from multiple instruments combine. All the information is in the pressure waveform, as our own ears tell us. This is still a bit simplified, because the rule is true only for minimum-phase technical devices: microphones, loudspeaker transducers, and amplifiers. Phase shift can distort the waveforms, but it turns out that humans don't hear it, especially when listening in normal rooms. We don't hear "accurate" waveforms in reflective spaces - like normal listening rooms or concert halls.
Interesting that you would buy electronics based on - frequently biased - specifications, but not loudspeakers. We now have meaningful measurements, as Amir, Erin, and others now publish to the great benefit of all serious audiophiles. I would trust that data more than my own ears in an uncontrolled listening situation.
What you next need to appreciate is how two ears and a brain allow us to perceive direction and space. Yes, you "only can hear 2 waves even with 8 speakers" but the human binaural hearing system allows you to appreciate that there are 8 speakers in the room. Stereo it isn't. Stereo is a basic problem for our industry, and it has become the default format. I dig into the details of this in the upcoming 4th edition of my book and the insights are very interesting. The phantom images that populate the soundstage between the loudspeakers are not comprised of accurate spectra or waveforms - both loudspeakers "talk" to both ears and there is comb filtering, especially noticeable for the featured artist in the centre location. A problem with multichannel audio is that the centre loudspeaker - a good idea - sounds different from the other phantom images on the soundstage. It is a challenge for recording engineers to deal with the centre channel. Humans are very adaptive - and forgiving! However, good immersive multichannel recordings can be remarkably impressive, allowing one to walk around the room and not lose the illusion.
Headphones, and the related cross-talk cancelled loudspeaker version, are fundamentally different. This is where 2 ears and 2 channels make sense, but recordings are mixed for loudspeaker stereo and what is heard in headphones is not what was intended. With technically accurate headphones and well synchronized head tracking binaural (dummy head) recordings through headphones can be remarkably like "being there". These technologies are discussed in the upcoming book, and they are serous options.