This is the way the concept of "twice as loud" works. I don't have the exact history in mind, but this is almost certainly a reasonable facsimile:
A couple of people or groups discuss the concept.
One person or group argues that loudness is like beauty, and the idea of quantifying twice as much is absurd. Beauty and loudness are in the eye and ear of the beholder. It is completely subjective.
Another person or group argues... well, we all have a good sense of "twice" and we can estimate "twice" other measurable quantities, such as distance or volume (e.g. of liquid), pretty well. Loudness is more like volume (pun intended) than beauty.
They argue back and forth (x N), but using words, logic, examples and analogies leads nowhere. Finally someone decides to test it. They get a large group of people and using an accepted method (back then, probably a calibrated reference sound with the method of adjustment; less used now) they determine what each person (alone) estimates to be "twice as loud".
If the results yield a broad uniform distribution (similar number between 4-6dB as between 9-11dB), then it's more like beauty and nonsensical to measure and report. If, on the other hand, the results give a narrow gaussian (bell-curve) distribution, where most people have a similar value, except for a very small number of outliers, then the concept makes sense.
The concept makes sense. I don't have the exact history ready for citation, but the idea works for various, measurable quantities that we estimate with sensory input. It would also (probably*) work with temerature:warmth/coldness IF! set up properly. Getting stuck on °F, °C or °K is avoided with proper design*.
* willing to discuss further, if needed.