My question is how much audio processing does the brain do? Is what our ears perceive what we actually think we hear?
From a technical perspective, the processing done by the human auditory system is very interesting. If you've read about the hairs in the ear (stereocilia), you probably realize that the ear isn't a simple time-domain transducer like a microphone (some people think the eardrum is like a microphone, but that's not its function). But the ear also isn't a frequency-domain transducer, that is, the stereocilia aren't arranged in different lengths with the brain performing the equivalent of an inverse Fourier transform. Rather, they're all roughly the same length and as sound waves make the basiliar membrane move, they all feed into the auditory processing. The closest conceptual thing if you're into signal processing is actually probably aperture synthesis, the way several small antenna are combined in a big radiotelescope. Which might not be as unusual sounding as you think, since we know insects with hundreds of compound eyes kind of do the same thing for visual stimulia, and it's kind of cool that that eyes and ears found similar solutions in different species.
IMHO it is really extraordinary how much processing the brain actually does.
To get an intuitive sense of just how complex the processing needs to be, this page has a good animation of one example:
https://auditoryneuroscience.com/acoustics/bm2-isolated-clicks The lower image shows which areas of the basiliar membrane move over time (think of stereocilia as "reading" info from the y axis) just to sense an impulse. Even though the basiliar membrane is designed to be sensitive to different frequencies at different places, the frequency range it directly senses is actually fairly narrow (about 6kHz to 400Hz), so to derive the entire range of human hearing (20Hz-20kHz) involves a complex analysis combining signals over the whole length of the membrane over a period of time.