Its not as difficult as you imagine, certainly with components like amps and dacs. Accurately matching volume levels and simply not being able to see the device(s) under test goes a long way to solving the problem. I have done this with many audiophiles and the differences they could hear when sighted and uncontrolled disappear or become very much reduced.
No skin in the game, but if you wanted to do this with scientific rigor to find relationship between audible differences and measurements (not whether you want to discredit some audiophiles or not which is not a very scientific goal):
1. Set up arranged by someone with no skin in the game to remove any implicit bias in the set up
2. Calibration of the set up and reference sample - Pick two amps with varying measurement characteristics and test to make sure that anyone can detect the extremes (otherwise the test is biased towards not hearing a difference). Now move to better and better measurements until the same people cannot differentiate with statistical significance.
2a. Drink beer.
3. Select sample test subjects with no bias towards age/sex/etc
4. To test at what difference in measurements the differences become inaudible in a DBT
4a. Use the calibration amps above and test them pairwise with one being the top measuring and the other varying. For each pair in random order, subjects just have identify the same amp as A or B in any pair with better than coin toss probability when you switch them between A and B.
4b. Find the threshold at which the median cannot distinguish any more
4c. Drink beer.
5. To test whether two amps sound the same or not (i.e., people can reliably hear a difference to pick one out). They don't have to pick which one they like.
5a. Select the reference pair above and pick some similar measuring amps above the differentiating threshold above in measurements but across multiple technologies and vendors. Mix the amps pairwise randomly.
5b. Repeat the test above to let people pick which one is A and which one is B when they are randomly switched and listened to blind.
5c. Check that people reliably picked the reference pair - which was picked for audible difference. If not go back to Step 2
5d. Drink beer.
6. Tabulate results for each pair. Find pairs which people picked to statistical significance (better than coin toss). If none other than the reference pair, thesis that people don't hear any difference above that threshold of measurement proved.
If they did differentiate two amps to statistical significance, investigate why they both measured the same and whether the difference is explained by some other measurements. If the latter, repeat the test above with the new measurement added.
7. Drink beer
8. Post your results to ASR.
If the above is too much work or impractical, avoid voicing definitive opinions on whether any set of measurements capture all differences or whether two amps sound different despite measurements
OR
Continue with the my opinions vs yours as usual with no scientific validity on either side.