So did he test to see if people can tell the difference before asking for preference?
No, but I don't think that's necessary in that case, because the test is done blind and a statistical significance test is done on the preference rankings. If people couldn't tell the difference, one would see random preferences with no clear pattern and a high p-value, and that's indeed what we see (except for the motherboard audio). IANAS (I am Not A Statistician), but the approach seems fair to me.
To me, the biggest problem, from a study design perspective, is this:
Isn't that interesting? For each device, the largest number of "votes" was in the order of presentation! Many respondents, specifically the ones who thought there was "no noticeable difference" simply voted "A-B-C-D" to create this pattern. In fact, for those who thought there was "no noticeable difference", the "A-B-C-D" pattern of response from best to worst accounted for almost 80% of those votes. This is basically "noise" that needs to be filtered out if we are to hopefully understand the true preferences of those who felt they could hear a difference.
I don't know how Archimago "filtered out" that noise. It would have been best to design a protocol where the files are randomized for each respondent so that we don't have this problem.