I don't know any good reasons for any of the three.
In the past, some digital processing worked better without artifacts or with lesser artifacts when done at 96 khz then released at 44 khz. That day has passed I do believe.
Actually, there is often reason to upsample after capture at 48 or 44.1 if you are going to do anything expressly nonlinear to the signal. You must do any nonlinear processing at an oversampling rate above the highest power polynomial expansion of your nonlinearity if you are determined not to allow aliasing of your processing artifacts back down into the passband.
There's a section in Bob Katz' book about this, and somewhere I have a talk about the problems of nonlinearities. Clipping is a classic one here. Clipping of tone-like signals will introduce 3, 5, 7 harmonics, which if done to higher frequencies can alias right down into midrange (or bass if you're particularly unlucky). This is one of the reasons you just do not ever clip tonal signals digitally. The clipping is much harder, and does terrible things with the components that alias down.
#2 is easy. Unlike digital responses, few analog devices, especially transducers just die immediately at some frequency. They roll off or have resonances of various kinds. So a wide bandwidth microphone might maintain much more linear accurate response if it works to 40 khz even for 20 khz recordings.
What people call "analog" (i.e. a continuous time, continuous level analog of the original) systems most often have resonances and filter performance that works on an 'octave' or "decade" frequency scale, i.e. 6dB/octave/order for filters, and the like. This kind of response does not generally have fast rolloff at 20kHz. Among other things, this is why it is common to use digital (i.e. discrete time, discrete level analogs) filters for the antialiasing filter, after sampling at a higher rate that allows for a simple analog antialiasing filter.
Digital filters (both IIR and FIR) as generally constructed have rolloff characteristics on a linear frequency scale, as opposed to the log frequency scale of "analog" systems.
Basically, the native frequency mapping of a digital filter (z domain) is linear frequency, and that of an analog second-order section (s domain) is log frequency space. It is possible to use 'z' domain filters in the analog world (FIR's built on a chip using Surface Acoustic Waves and the like) but they are much less common. Needless to say, one can, within the bandwidth, create a filter in 'z' domain that kind of mimics the analog performance, but with the caveat that you can't go above fs/2.
#3 I can't think of a good reason. Maybe a given 20 khz microphone has a great color one wishes to use. Like ribbons for instance. Not going to see too many ultrasonic capable ribbon recording mikes. On some voices or some instruments they have a sound, one that is pleasing, and if using other microphones on other instruments, using those for a 96 khz recording might make some kind of sense. Of course that is if the 96 khz recording made sense (which generally it may not). The other case is for people who worry about the audibility of filters. I think it an overblown non-issue for the most part exceeded only by the blather about jitter. If you think that filtering at 44 khz has audible consequences, you might record at 96 khz to push those filters well beyond audibility while knowing you only need a pretty good 20 khz response from your microphones.
The choice of microphones depends both on the pattern (cardioid, etc) the mic provides, and the evenness of the pattern it is built to have. Ribbons, for instance, often have tighter patterns at higher frequencies, and provide a "mellow" or "smooth" sound due to the fact they reject more high frequencies off-axis. (they don't have to do this, many do) Part of miking an instrument is understanding what sound you want. This is part of the creative process as well as part of the documentary process, and I am not going to write that book this night.