As Calculus solved Zeno's paradox, the barrier you are hitting isn't necessarily ontological, it’s probably just a technical limitation regarding HRTF and DSP.
The reason even Stax playing binaural recordings failed you is, I think, that those recordings utilize a generic dummy head. If the shape of your pinna and ear canal differs significantly from that dummy head, your brain rejects the spatial cues.
Furthermore, the "sound source next to the ear" issue you described is caused by the lack of crosstalk and reflections. In the real world, your left ear hears what’s on your right side with a slight delay and tonal shift. Headphones isolate the channels, destroying the geometric triangulation your brain needs to project sound externally. The Zero:2 likely wins for you because it nails your preferred frequency response, but that would vary depending of ear canal shape (they sound awful to me), a lower acoustic impedance load would provide a more consistent frequency response among users. Right now, buying a IEM is a lot like rolling a dice.
I love the virtualization effect of my Creative soundcard, it really does the trick for me, and I wish it worked on Android.
I don't see any reason why eliminating occlusion and adapting the sound to your HRTF with DSP wouldn't result in the perception of soundstage. The sound waves reaching your eardrums would be isomorphic, no matter where do they originate.
But I may be wrong. More knowledgeable users may explain why it won't ever work.