Some of all of that. The way I see it is this:
What reaches our ears is 'a signal', a 2D wiggly line, but our brains decode it into the original acoustic 'objects', and at a higher level they also follow progressions of rhythm and melody, etc.
The brain could be mistaken about the acoustic objects: what sounds like two people singing could really be four, but one of the singers might be singing in exact antiphase to one of the other singers and cancelling them both out at the microphone - but this is very unlikely! Maybe the vocalists are adept at singing phonemes in clever rotation and chopping the words up between them. Anything is possible, but in reality, our brains use logic, familiarity and continuity to decode the signal back into the most likely integral objects.
Each object has its own timbre, rather than 'the signal' having timbre.
Without knowing about the nature of microphones, cables, recording systems, speakers, human listeners would never think in terms of 'the signal', but would only think in terms of the objects. An orchestra conductor would never say "I need more mid range!"; he would say "I need more violins! And less of the trumpets". He wouldn't say to the whole orchestra "I don't like your timbre...", but he might say "The trombone is too strident...".
But poor reproduction (frequency response, distortion, phase, timing) means that in audio, the acoustic objects are often blurred together in *meaningless* mechanical/electronic-related ways that have nothing to do with real sound and acoustics. And this, coupled with knowledge about the mechanics of audio reproduction, is where the audiophile transition is made from objects to 'signal'.
Once the separation between the objects is degraded sufficiently, all that is left is to adjust the 'colour', 'flavour' or 'timbre' of the stream of lumpy audio paste - and people can actually enjoy listening to it, but it is a different mode of listening compared to musical 'objects' in the live situation. The higher level stuff (melody etc.) probably remains intact, but some of the lower level complexity and/or simplicity is gone, replaced with a synthetic, artificial, uniform substitute.
As this substitute is regarded as 'high end' there is no reason for anyone to strive for anything better.