Thanks or posting the details. I have no conclusions to advance just processing the information and testing them for validity.
I will assume that he is reporting the results without any bias or manipulation.
Sometimes being "controlled listening tests" is overused. It depends on the purpose as to whether it is necessary. If you wanted to make conclusions based on a small sample space as to whether two sounds are different, then you would need a controlled listening test for the express reason of eliminating other variables that could explain the difference.
But this is more of a statistical sampling where the listening conditions are sufficiently randomized over a large sample space so presumably no one single variable dominates. In such a scenario, if you tossed a coin and reported the results, over a sufficiently large attempts, they would fall on a binomial distribution around the middle. As many people would roughly say they are different/worse as they say it is the same. But if there is a skew in that curve towards same or worse, it is possible to statistically conclude that the two are considered the same or considered different/worse to statistical significance.
As a form of empirical science, that would be a valid observation.
If he is able to find a positive correlation between Df values and the results that detected a difference to statistical significance then, while it might not be a perfect measure, it may have some validity. Note that in such empirical studies, it is not necessary to have a mechanistic explanation as to why this is so or even have a theory to explain it.
So the following
isn't necessarily a knock against it for the limited conclusions being drawn. If he claimed that Df somehow captures a mechanistic explanation of what makes sounds be perceived differently, then his metric must be consistent with psychoacoustic analysis. I don't see him making any such claim. So he shouldn't be held to that standard.
It is a valid question as to whether his measurement itself is robust and repeatable. While he might have done a large sampling of listeners, the number of samples he has is relatively small and so it would be difficult to make the case that his method is well-defined enough. All he can say is that for that set of measurements done, there was a positive correlation between Df values computed and perceived difference to statistical significance.
But for that conclusion to have validity and confidence, this needs to be tested against totally different set of samples (preferably by someone else using similar equipment which is why repeatability is important in science process) and a similar large sampling of listening conducted. If it shows the same positive correlation between his thresholds and perceived difference then the confidence level in the correlation increases and the measure becomes more useful. If the second set showed no such correlation, then his first sampling was a statistical aberration. If it showed a very different cut-off point then the reliability of that metric to capture "badness" would be in question.
The final test is one of forecastability. Take another sample, compute the Df values and based on earlier tests predict which ones would be perceived as different/worse. Now test that hypothesis over a sufficiently large population. If the predictions were supported in a statistically significant way, then it would be a valid metric for that purpose.
This is a valid concern. It is possible that the values fall so close to each other that just arbitrarily drawing a threshold between them may not be statistically justifiable. It could mean that the metric does not have sufficient resolution/granularity to separate out goodness from badness. BUT, it could also mean that the units tested were too similar to cause such a little spread and so the audibility correlation might be an accidental consequence of small numbers (spread) which would be caught in a totally different sample.
At the least, it suggests he needs a more "discriminating" metric that would spread it out. But then he is constrained by trying to differentiate between units that show very similar SINAD to claim that his metric can differentiate between them. That may just not be possible with his current metric.
Yes, it could be a diagnostic tool for QC purposes, not as a defining metric but rather as a way to prompt further enquiry if the number was to fall in the badness range with the caveat that it could be a false positive.
For a metric like that to be used here, it would require the forecastability test I mentioned above. On the other hand, we don't have any such result for SINAD other than at extremes. So, we are discussing a known devil vs an unknown devil.