As an aside, psychological outcomes is too loaded a word. Hearing responses and limitations of hearing are physiological phenomena that can be tested. This is done in the treatment of many ear issues. For example, ear ringing in response to only certain frequencies. Hearing fatigue when exposed to sound (volume or frequency related) is also a physiological phenomena.
I think, the valid issue you are pointing to is that these physiological characteristics vary from person to person so how is one supposed to predict what any individual will perceive it as. But not because it is some unknowable psychological factor.
First of all, from a science methodology point of view, difficulty in knowing something does not excuse making invalid inferences from observation. So, the point about limitations of the inferences we can make on audibility still stands based on measurements. I was not making any point there about how anyone is supposed to compensate for it in equipment or measurement design, that is a different issue altogether.
In Science, one does not have to be ready with an alternative to invalidate an inference or assign a limitation.
Second, while individual characteristics vary, there are probabilistic categorizations we make in science all the time because human characteristics fall into broad categories. For example, what is a safe cholesterol/blood sugar level in a probabilistic sense. Weather is extremely difficult to forecast beyond 5 days because the number of data points needed and the variables needed are far too many in practice. Yet, we make probabilistic forecasts. So, science techniques exist to handle these things. In some cases, more study is needed.
For, example, threshold of noise audibility. We have some general numbers on what noise is audible or not in volume and frequency ranges. These are probabilistic as there are individual characteristics and we could still consider a cut off as being applicable to a larger band of population but with that caveat mentioned. The engineering equivalent of this is saying a DAC is fine as long as it is kept below an output level of xV.
Measurements beyond the threshold should not be included in a metric to assess the deviation from the input if the cumulative “score” can be a mix of inaudible artifacts adding up. This is a wrong metric design to use if one is making an inference on the impact on audibility of a measurement. Yes, there are some qualitative statements made that certain things are likely inaudible but still using those in any cumulative metric is problematic (except from an engineering deviation from input perspective where audibility is not the criterion).
Unfortunately, an indiscriminate reliance on things like SINAD because it is easy for people to consume when put on a comparison table between equipment to make inferences on audibility is not a valid approach. As an engineering excellence goal, sure why not ... as long as one recognizes it as such.
The gap that really exists in science between measurement and audibility is the lack of studies on correlations between qualitative hearing perception with measurable metrics. I have said this before. What would be much more valuable is a correlational table beween things like perception of detail, stage, warmth/brightness, etc., to specific measurable things. With that you could make probabilistic and useful statements such as equipment X will likely appeal to people who like Y or that if you liked X you will probably like Y at half the price. That is far more useful than current metrics but we have long ways to get there.
We certainly cannot assume the current measurements are the last word on audibility evaluation because the correlation between these numbers and audible perception is poor and ill-defined. The latter cannot be just dismissed as if only there were “controlled tests there would be good correlation...”. That is just copping out. Until then there will always be a group that will reject these measurements and for a justifiable reason.