It turns out, I do
I looked at it in detail, there are 2 independent variables that affect the preference rating, the std-dev of error curve, and absolute value of the slope of the regression line of the error curve, both of which negatively contribute to preference score.
Looking at the error histograms of both headphones, clearly Stealth has a lot smaller std dev so is the clear winner in that metric. For the absolute slope, HD800S is almost flat while Stealth has slightly larger slope which somehow closes the gap for the std dev score it seems.
When you are training a neural network, it produces these weights and sometimes if the model is small, you expect these weights to have a sense, a meaning of sort. They usually don't yet the model works anyway. This feels like that a bit to me - the deviations from the curve are wildly different but predicted preferences are very similar.
So maybe the preference is not so much about matching the target at every turn. Maybe then it might not be a good idea to look at deviations from the target curve and claim those deviations point to an objectively lower preference? Because, if my understanding is not wrong, it looks like even if a headphone is deviating from the target curve, and doing that in a balanced way that keeps the slope of regression line flat, predicted preference of that headphone might not be very low afterall.