Psychoacoutics by Fastl and Zwicker, chapter 4 (masking). Dirac also
published a PDF talking about room correction (see "The Fourier transform and the concept of frequency" section) that takes about this effect: "
As Dennis Gabor noted in 1945, the Fourier representation of a signal shows significant departures from the human sensation of frequencies". So you don't have to believe me, Gabor has a Nobel prize
I agree that if you clap your hands in a room you'll hear an echo. That's because the reverb tail in a normal room lasts longer (~500ms) than the masking effects (~50ms), but that doesn't mean that the psychoacoustic effects don't exist.
The bottom line is that human hearing is a "time-variant" system. However, when you play a continuous tone in a room it has the effect of eliminating the time-variant aspects of human hearing! That is because a continuous tone played in a room turns into a steady state vibration at every point in that room. In this special case, both an FFT and the human ear will process the (steady state) sound the same way. But, when you have normal sounds with temporal variation (i.e. not continuous tones) the human ear will hear them differently than an FFT.
We can simulate how the human ear hears with a set of frequency and time dependent filters collectively called an "auditory filter bank". Using this we can (and do) see how the FFT and human perception differ depending on the sound. This is the basis of the statement I bolded in my post.