Finally watched some of that.
As I understand what he said, the "highest rated" is one he equalized to his specification and measured, the rest, I suppose were measured without EQ. Some came closer than others.
Later, 15 'phone measurements from an earlier test:
"The blue curve is what people prefer"