But wouldn't directivity affect this direct sound field much more than the total sound power?
I'm sorry, but I just don't see what is the benefit of this approach. I guess having a narrow directivity can have some benefits, but so far I don't see any data supporting this assertion.
You have misunderstood me there. Below the critical distance, the direct/free field is dominant. In an ideal free field there would be no reflections and the sound character of the loudspeaker would be determined solely by the listening axis - e.g. the on-axis FR when the loudspeakers are directed towards the listener.
In this case (ideal free field) the directivity of the LS would not matter at all, only the FR on which the listening is done would play a role.
In an ideal diffuse field, the sound source direction could no longer be determined, since all reflections would be equivalent. Since this can never be realized in a normal listening room, there will always be a mixture of direct and diffuse field at the listening position, with the diffuse field being dominant over wide frequency ranges (see also the calculation of the critical distance).
So you need a measure that describes the behavior and the sound character of a loudspeaker at the listening position including the room reflections.
This is where "sound power" SP and "predicted in-room amplitude response" PIR come into play. In the calculation of these two curves the radiation outside the on-axis FR of the speaker is included.
Historically, the SP curve has been used in loudspeaker design for a very long time - I come from the DIY field and there the SP (calculated from measurements +-90° horizontally and, not always, +-90° vertically), has been used for over 15 years.
The PIR reproduces the behavior of the LS at the listening position even better. Whereby the weighted average of 44% SP also flows into the PIR (see CTA-2034a definitions).
Why is SP and PIR important?
It helps to explain why, for example, Amir found the
Wilson-Audio tunetot tonally not bad and with a little EQ even good.

Especially with the PIR, you can see that the LS does not show sudden or wide humps (when we ignore the ugly bass hump).
SP/PIR can help explain why even poorly designed speakers can still sound good in the listening room if they are tuned properly with SP or PIR in mind.
The better a loudspeaker is designed, the more the on-axis, listening window LW, SP and PIR curves converge:
For a loudspeaker that is nearly SOTA, the various curves are nearly parallel (ignore FR >15kHz) because the radiation is so uniform that the off-axis FRs hardly change. You can then focus on small details, for example, and use different XO designs to strongly influence the vertical radiation around the crossover frequency. Same LS with two different XO:

In both cases on-axis FR (black), LW (green), SP (blue) and PIR (orange) >700Hz are nearly parallel (DI is red).
However, since very few loudspeakers are SOTA, it is important to look at SP/PIR, or to consider these curves when designing and tuning a loudspeaker, and to value a smooth SP/PIR response more highly than a flat on-axis frequency response - at least that is what Stephen Entwistle probably wanted to say, and with which I strongly agree (I am more skeptical about other statements in the interview).
It is important that a uniform SP/PIR response is present regardless of wide or narrow directivity.
With a narrow directivity it is usually easier to achieve a uniform SP/PIR response and the curve shows a steeper drop.
Therefore horn loudspeakers can often be corrected quite well via EQ (if the directivity is even, which the DI shows us). Example without and with EQ:
Source
That was another annoyingly long post, but it try to show that not every LS with a (slightly) wavy FR has to sound bad.
Conversely, you can say that a classic 2-way LS (6.5'' woofer and 1'' tweeter) in a normal cabinet with classic XO around 2kHz with an optimally flat on-axis FR will not sound optimal in a normal listening room (too aggressive at high SPL).