IME it mostly depends on duration, frequency and the envelope of signals (no surprise here). On my loudspeaker system, signals with flat envelope (i.e. sinewaves) ranging from 700Hz up to (my) audible band end up localized completely inside my head, right in the middle (mono signal, no polarity flip, just careful alignment). And it does not matter if I turn my head.
However, signals with flipped (and not flat) envelope between the channels are localized with varying degrees of proximity, just never inside the head. They usually contain some ILD and ITD information as well, also I do not listen in an anechoic environment, so reflections will cause image shifts, for the better. Transients are completely different story, normally being very short in duration and, more often than not, very complex signals (real music). This creates an illusion of sound stage, sometimes holographic, depending on the recording.
Headphones, on the other hand, never create anything out of the ordinary, just plain and simple "inside the head", except for handful of processed "3d illusions".
FYI I would highly recommend this (no special introduction required)
: