By starting with an acoustically-pleasant space (whatever that means) and then playing back via loudspeakers which are intended to mimic the original instruments, positioned however the listener sees fit. Perhaps all of those loudspeakers should be attached to CNC arms, so they can be re-aimed/positioned arbitrarily.
True optimisation in that direction would resemble a science experiment, but I suspect the results could be very good.
In this mode, having the acoustic image localise to a loudspeaker isn't a problem. In fact, it's intended: one instrument, one playback speaker.
The imaging, then, should be perfect.
There are few serious attempts at this stuff. Here's one:
https://www.moma.org/collection/works/87291
Though I'd suggest that the speakers used will just happen to be approximately the size/radiation pattern of a human voice, rather than chosen specifically for that purpose.
Chris