For example, I can easily hear that a speaker is "dynamic" in that it transients have attack and sound clean, but what measurement does this correlate to? I am VERY confident that this is a real subjective phenomenon. I thought it might have something to do with the sharpness of the step response, but I was wrong. I have since found that you need to measure loudspeaker compression and plot the input against the output - e.g. for every 5dB increase in volume my DAC outputs, I should measure a corresponding linear increase in loudspeaker volume. At some point the loudspeaker will compress and distort. At what volume this happens and by how much is part of the explanation. Other measurements which might explain the subjective impression are early reflections in the ETC, ringing which might smear the transient, and so on.
For this specific phenomena, I have many doubts. I wonder how much I am personally predisposed to equating dynamics with hornspeakers. I haven't had a chance to do side-by-side comparisons, so I am wary of my memory.
In other words, I don't know if I could I identify a speaker as dynamic easily, under blind circumstances.
We know horns, generally, have (1) narrow directivity, are (2) less likely to compress because of sensitivity and efficiency and (3) many designs have issues like uneven frequency response and directivity because of the difficulty of horn design, or, more practically, size constraints, with many horn systems featured only for the tweeter, with the midrange and bass being commonly taken over with dynamic drivers.
We also know, generally, how hearing works: (a) below 1.5kHz, phase locking, (b) around 1.5kHz, a transition region, (c) above 1.5kHz, direct sound dominated.
We further know that room geometry and materials determine (i) modal effects in the bass region, (ii) significant interference effects in the transition region, (iii) have limited relevance for direct sound in the statistical region, but very significant for reflections.
So I would always try to look for the answer for what I heard in 1, 2, 3 and a, b, c and i, ii, iii. Theorizing is necessary because
we don't have all the data. Most listening prevents that, and most listening is uncontrolled. That doesn't prevent us from trying to understand the phenomenon (i.e., the "subjective experience"), but it should prevent us from jumping to conclusions (let's call that "uncontrolled theorizing", as a correlate to "uncontrolled listening").
In your example, I would agree that compression behavior is an important aspect of understanding perceived dynamics because, beyond level, there is a strong impact to frequency response. Your approach to measurement is reasonable and has many variants (e.g., Meyer Sound is proposing M-noise as a signal with a known crest factor, so the measurement would take into account the impact of RMS vs. peak levels at different SPLs).
I would disagree with respect to ETC because ETC is based on a linear calculation that puts too much weight on higher frequencies (or, putting it differently, it is most useful for evaluating high frequency decay--if you want to evaluate this for yourself without a lot of trouble, look at waveforms in a music player or DAW, those are also based on linear frequency weights; high hats and snares have no right showing up the way they do given how much relative energy they contain). "Smearing" occurs in rooms in and around the transitional region, which is fairly low down in frequency.
Long way of me saying that the mostly likely answer for understanding "dynamics" is to do with frequency response of direct sound primarily, speaker directivity secondly, and compression as a less-probable third.
I think if there is anything I value at ASR, it is "controlled theorizing".