My work is in the design, analysis and interpretation of clinical trials in some very specific areas. I would like to add a few comments to give perspective to what
@amirm states regarding "training" and statistics.
First, delving with statistics. I do have to caution that the hypothesis you test makes a difference. If you believe that they are different and testing for difference is not the same as assuming they are the same and testing to fail that they are the same. It is a rather difficult concept but it requires much different sampling. Not a serious issue for this group but you should have some caution when using "statistics".
The second has to do with training of "testers". I will give a few examples. In rheumatoid arthritis you need to measure joint pain and inflammation in patients. But this could be a very subjective issue for the patient and for the observer making the assessment. In trials, you would have an assessor that is blinded to the treatment, different form the clinician following up on the patient. You will also have all the assessors BEFORE the study starts trained on how to all reach the same degree of inflammation and tenderness (pain) score for each patient and joint. You bring a large group of patients and every assessor sees every patient. Then, the head trialist explains how to make sure they reach the same score. Training is intense because the "joint count" is a key aspect of efficacy in this disease.
The second example is from the "reading" of colonoscopies in studies of Crohn's disease. Even though there is a specific protocol on how to read and grade them, doctors are still unreliable in trials. Most old studies had "local" readings and doctors had an incentive to enter patients that needed to have a minimum score in the scopes. This resulted in high placebo responses in colonoscopies in trials. As this was not biologically likely, a friend of mine took one trial and reread every scope blindly, not knowing if the scope was before or after treatment. What happened was that the blinded scores at entry were much lower than the "local readings". But the scores at the end did not vary much. In currently designed trials, you have a central blinded reader to enter patients (to eliminate local trialist incentive) and then scopes are read at the end of the study without knowing the order. And these biases happen in a very sophisticated environment with highly trained professionals. So, training and limiting subjectivity is essential at every level.
My last example is from a virology study. There, the doctors and patients were open to treatment, but we only took patients that had specific criteria at entry and compared the lab results, which were blinded (the machine had no "idea" who they were) and the virology results were valid. You can't "fix" the lab (except in sports, but that is a digression).
The point is that in many scientific endeavors, the very intense training of assessors or participants is essential to getting reliable results. What
@amirm explained is the right way to do this work.