While I do think it is commendable that you repeated the test with level matching, it is IMO much too early to jump to any conclusions - TBH repeating your claims after each round of informal tests makes it seem like your mind is already set on the desired outcome, even before attempting any test. This is suggestive of confirmation bias, making it more difficult for you to facilitate a reliable test in the first place; and unfortunately also for others to have confidence in the results you present.
Please don't take this personally - we all sometimes fall prey to confirmation bias - that comes with being human.

However it is why we have to work hard to eliminate its effects from our investigations. It is also prudent to be reasonably skeptical of one's own work - especially when it is producing unexpected results.
Lastly, I hope you realize that listening sighted is a serious problem. You may notice that not all test variants I proposed in
post #4 are level-matched, but all are blind. In addition, multiple randomized trials would be required to reach any statistical significance. It can be tedious work.
Honest question - if your motivation is to learn and understand, why offer money / make a challenge out of it at all?
If you're honestly interested in this line of investigation (as you seem to be, given the time invested so far), and believe you have solid findings that corroborate your point of view, why not spend the time, money and effort to instead learn more about controlled listening test methodology yourself, build a solid controlled test protocol, and try to present your findings somewhere for formal peer-review?
Do you have any explanation for the pretty severe HF roll-off seen in the Topping stack FR (-5dB @ 20kHz)?
'Hairiness' in the FR plots may also suggest some issue with the setup - perhaps driver-related? In addition to FR, I'd suggest to also show a distortion vs frequency plot.