It's interesting how statistics has different flavors depending on the field of appliance. I was not familiar with those quantities, although that doesn't mean much as I don't have that much experience.
....
....
The approach of @Semla is more ambitious as it builds a model to predict future scores.
The discussion with @Semla is more about the assumptions that can be made or not, rather than in the nature or reliability of the tests themselves. He/she proposes a linear model with dummy variables, because it accounts for all relations between the different factors, which is indeed very reasonable.
Please, correct me if I'm wrong.
Edit: I forgot to answer whether that coefficients you point can be applied here.
The answer is no because they are based on real observations after the experiment, measuring all four possible outcomes and calculating some ratios. But that doesn't makes sense in this context, as there aren't outcomes since there is no treatment effect. It's a different setting.
Thank you so much for your kind response. I also feel and fully agree with you that "It's interesting how statistics has different flavors depending on the field of appliance." I am just curious about how we can objectively assure or measure the reliability of "the test" we are discussing...
In the medical diagnostic R&D, if we could get the results shown in my above example diagram, "the test" should be reasonably reliable because of the sensitivity=0.917 and the specificity=0.750 given by the total 200 cases analyzed. Usually, we analyze more than 1,000 cases, sometimes more than 10,000 cases for sensitivity and specificity discussions.
For the diagnosis of very rare diseases (or disorders, abnormalities), however, we inevitably encounter the limitation of the sample/test cases, e.g. less than 200 cases. Even in that kind of limited situation, if sensitivity/specificity exceed 0.75 (or 0.80), we may assume "the test" would be relatively reliable. "The new test" which has significantly better cost effectiveness compared to the very expensive "golden standard" procedures, would be consequently accepted (approved) for routine clinical utilization as a screening test before the "golden standard" test. In this situation, we should be careful enough in including as much as possible "false negative" cases (or highly disease, e.g. cancer, suspicious cases) to go forward into the expensive "golden standard" test procedure(s). In other words, in the case of disease diagnosis, false positive cases would be rather accepted (for further expensive precision diagnosis), but false negative cases (like dismissing
I know that the comparative audio listening tests also always encounter the limitation of sample/case numbers for statistical analyses. I would like to have/know simple and easy-to-understand objective "reliability Merkmal" for such a comparative audio listening tests of rather limited sample numbers. I assume your and @Semla's discussions in this thread would be almost sufficiently answering my concerns.
Last edited: