This is a reasonable question. The problem is that it takes double-blind controlled tests to determine what is required to "properly set up" a system. People have made assumptions many times over the years, sometimes writing them into standards, only to learn that there is more to the story.
A systematic scientific investigation of loudspeaker-based sound reproduction involves several separable stages, such as:
1. the sound source (loudspeakers and detection thresholds for resonances and non-linear distortions within them),
2. loudspeaker directivity and the interaction with adjacent reflective boundaries - low frequency sound power boundary interactions and higher frequency specular reflections,
3. the wavelength-determined low-frequency resonances in small rooms that modify timbre and that create loudspeaker and listener location dependencies.
4. the number of channels and capture, storage and reproduction algorithms necessary to deliver more than mere sound quality. What elements of direction and space are involved in making such a decision?
These are the large variables, not subtleties, and in the process of examining them, one inevitably learns much about what is audible. It is a process of chasing diminishing returns. When these identified audible problems are minimized, the question then is: what do people consider to be "good"? - there may be an element of personal preference. When nobody complains, can one assume that the system is "properly set up"?. But, then the question is: if they did complain, is it the program (an infinitely variable, non-standardized quantity) or the playback system? Are there compensating errors?
When we have conducted experiments in "personal choice" using highly rated loudspeakers set up reasonably in a room and allowed listeners to freely adjust bass and treble to please themselves there can be huge variations. Program, as it must be, is a variable, and as bass extension and level is a common variation within programs the largest personal variations were in bass level. Inexperienced listeners occasionally tended to prefer a lot of bass boost - were they bass-deprived in their normal listening? Experienced and trained listeners were much more moderate, preferring a more transparent, neutral, balance in playback. But in multiple-loudspeaker double-blind evaluations all listeners prefer loudspeakers without resonances. That appears to be a necessary starting point, but beyond that, preferences in spectral balance might differ, and certainly will differ because of variations in programs. The forum discussions about varied preferences in "room curves" (a result not a target, I will add) is proof of a kind.
So many variables, in addition to susceptible, adaptable and occasionally capricious humans.
A separate, and potentially more rewarding method is binaural (head related) recording and reproduction. However it is antisocial. The present popularity - it is the dominant sound reproduction method at the moment - is interesting because, in effect, all recordings are created for stereo loudspeaker reproduction and through headphones the perceptions are very different, and not at all what was intended. The convenience of portability surpasses any need for reality. So, yes, humans are complicated.