Any ballpark or cases you recall of the speed differences? I can't fathom anything these days being THAT critical. But then again, what would I know of mission/speed critical applications. Ive spoken with architects of high frequency trading software, don't recall even they mentioned even looking at Assembly, let alone using it. What sort of software would necessitate Assembly usage in the modern day?
For the critical parts, I have seen speedups in excess of 10x with 4x being more typical. This was before compilers could vectorise loops automatically. They've got a lot better at that lately, so the gains from hand-optimising are smaller now, but even the best compilers often miss a trick or two, and turning to assembly can still be worthwhile. Although vector code is the obvious target, linear code can also benefit when the compiler stumbles. Modern CPUs with out-of-order execution have, however, reduced the need for optimising these cases by hand.
The code I've worked on has been mostly codecs (audio, video, and images). Signal processing in general is amenable to assembly optimisations since you're typically dealing with relatively large chunks of work with few conditional branches (not counting loops). The longer an uninterrupted instruction sequence, the better your chances of doing something clever with it.
These days, you're probably most likely to need assembly level optimisations in real-time microcontroller applications, especially battery powered, where you want to get as much as possible done before the deadline. There is less demand for extreme optimisation of regular desktop or phone applications. Too bad, it paid really well.