Their choice of preferred EQ is likely dependent on volume so if you don't control that variable you are comparing apples vs oranges. I would only allow them to adjust volume if that is treated as a measured variable. So maybe they adjust the EQ to preference at some standardized low/medium/high volumes. It would be interesting to know what their preferred volume is although that data exists somewhere.
Related to controlling for volume, I discussed in this exchange with @Floyd Toole the difficulties of loudness normalization when doing blind tone control / EQ preference tests. What specific loudness normalization procedure would you say would be best if blind testing say two EQ profiles on the same headphone e.g. one with Harman-level bass and another with zero bass 'boost'? And is this normalization really necessary, considering your findings in this paper which (if I understood correctly) seem to suggest loudness normalization during MOA tone control tests did not have a statistically significant effect on preferred bass shelf level and starting frequency (at least for IEMs)?