Here's the problem. The equal loudness contours are a smoothed average. You have no idea what the fine-grain detail of your specific equal loudness contour looks like, and so no way to differentiate dips in it from peaks in a headphone's frequency response when listening to sine sweeps.
So has Oratory, but he still doesn't know exactly what they should actually sound like to his particular ears (and it's certainly not equal loudness across the sweep, it would be different for each person):
So, I see what you're saying, but are individual contours really known to vary like that?
Anyway, I think I am using tones in a way that is probably not as objectionable as you think. I don't really know or claim to know what "a sweep" should sound like. I have a loose idea of what a sweep sounds like on a decent set of cans vs. a terrible set, but I don't know if I could easily differentiate very good vs. excellent ones that way.
In practice the workflow might be like this, if I am not using measurements:
Listen to some music to get a general impression
Run a sweep to find obvious problems
EQ a bit to remove obvious problems
Listen to some music again, take notes on potential problem areas (generally the same 15 or so reference tracks over and over)
Manually sweep tones in those areas (typically comparing less than 1 octave at a time) to further investigate problems
EQ a bit more in those areas
Listen to more music
Repeat as needed
I definitely don't sit there listening to 20-20 REW sweeps and computing FIR curves in my head. What I would do is not far off from what Oratory said was potentially acceptable, listening to tones in small ranges to iron out undesirable peaks and valleys. I think what I do is even a bit similar to what Amir does when he reviews a headphone, I just like using tones more than music or pink noise for some reason.
But philosophically speaking I still don't 100% grasp why a user shouldn't EQ subjectively flat using tones, supposing they can successfully compare 40 to 80 to 200 to 400 to 3000 hz and beyond (which I agree is pretty hard if not impossible). Wouldn't that just result in something like a personalized harman curve, which is also just a smoothed population-based average like fletcher-munson?