Ok, let me first say that I admire your commitment and your work on this, so please don't take my reply as "internet hate"

I just want to warn you about the risks of taking the average of only two data sets. Averages have a statistical meaning only when you take them across a population of many, not 2, 3, or even 10. With such small numbers your result may be severely biased by outliers. So maybe sample A is the more common one and sample B is an outlier, and taking the average you end up with a result that suggests that the more common occurrence is in the middle, while in fact it's simply A.
I think the discrepancy you found is very interesting and definitely deserves further investigation. Maybe tester B does something wrong (I'm talking about a generic tester). Maybe there are two batches of the same product. Imagine that: those that have batch A have a pair of headphones that is perfectly served by curve A, and an average of A and B would just be suboptimal.
So by all means, keep experimenting and asking questions to the people involved in this. Mine is just a (hopefully constructive) remark on the method.