surprised by the many replies, really an amazing community, I'll try to give you a clearer idea of this, since I've done some more research and I'm still not 100% sure of whether my exact scope is possible or impossible (I may have over simplified it in my post):
TWO important things that I may have described wrongly:
1) i don't want to know the exact time (example: minute/hour/day/month) that gunshot A happened, I just want to know the relative time of it.
so if it happens 01.33s into my audio file - then I would want my program/plugin to be able to say: gunshot A happened at 01.33s
2) I do not have multiple audio sources/microphones recording this specific audio - so I cannot overlap them to get exact timing info by "clearing" the background noise
This sounds like the equivalent if meteor detection. Not an expert but found this which may put you on the right track
https://www.meteornews.net/2021/03/30/automated-feature-extraction-from-radio-meteor-spectrograms/
Regards Andrew
^ I've just taken a look and - wow. the fact that this was possible gives me hope, I am stunned by what seems to be the precision with which everything was labeled, especially the ones closer together. STUNNING. I wish I was smart enough to fully get a grasp of this, I will be reading and researching, and definitely looking for an expert, this doesn't look like something I can do by myself.
A quick and easy way is to look at the signal in the time domain (waveform) and not in the frequency domain (spectrograph). You should be able to visually identify the gunshots in the waveform, I would guess. Any DAW software allows you to do that and probably some others too …
100% understand this. this would be the way to go if the gunshots were more far away, but maybe I'm missing something, because...
when the gunshots are so near each other, when looking at the waveform, I can't tell them apart.
because of the "fading" - once a shot happens, it still lingers on, and by looking at the "fading" I can't tell if it's simple fading, or a quieter gunshot.
But after looking into this, I think I'll explore this possibility more - spectrogram just helped me visualize it more, I may attempt slowing it down
The time resolution for the spectrogram is lower than the raw data or the waveform and that's just the nature of
FFT* and it's the nature of the data... A couple of samples don't contain any frequency information. You'll see that if you zoom-in to where you can see the individual samples in Audacity.
* FFT is the data used for the spectrum or spectrogram display.
i have realized FFT may not be the perfect fit here, I heard "LAC" is a better fit for this kind of thing, will need to research this more to see which one (FFT/LAC/Simply amplitude) is a better fit
Point is: it's possible to identify transients both in the frequency and time domains, and likewise possible to count the number of incidents. Pulling close incidents apart is pretty trivial and could likely be accurate to within tens of milliseconds. An issue that could arise would be environmental echo. A single shot with an echo could be reliably identified if there was more than 1 shot. However, it may be difficult to distinguish between an echo and a fully automatic weapon, *or* pulling apart echos from *many* shots fired automatically in quick succession.
Making the distinction between gunshots and the surprisingly large number of sounds that are similar to gunshots (but NOT) is a totally different issue. (Just ask car enthusiasts in Chicago.)
What is the purpose of this thought experiment?
one thing:
echo is 100% not a problem in this case, at least not to the point in which it's interfering
second:
regarding being accurate to tens of milliseconds - that is EXACTLY what i'm after.
a clearer - real example:
I know for a FACT that in a specific section of the audio, lasting 0.80s in TOTAL in audacity
there have been *16* different gunshot events.
basically, on average, one every 50 MILLISECONDS.
(Some had a longer pause - some were super, super close)
if I look at the spectrogram, first of all - it is a mess
depending on how much I zoom - or the "window size" (or bins? not sure, audacity calls them "window size"), then I can see
sometimes -
18 of them
if I increase the size, and zoom out just a little bit, 15 of them.
when I see the amplitude, it is even harder for me to tell, because there are
8 big, giant peaks, that are clear as day
while the rest are much more of a bet -
the best I've managed to "see" is 17 total in it - but I'm not reliable, each time I try, I see something different. Sometimes it seems like it's a fading of a previous one, sometimes it looks like a quieter one.
Sometimes, two gunshots are so near each other, that in the amplitude window they look as one.
another question is: what about slowing down? would it help make the software/plugin/human be able to tell it apart better?
(Slowing it down for example allows me to constantly count 15 both in amplitude and in spectrogram - so where is the 16th? I really have no reliable WAY of knowing HOW to label them - and don't think anyone has any clear instructions either, since this is so up in the air and depends on what you're analyzing)
I would apply the techniques of image recognition using convolutional neural networks over the waveform and not the spectrogram, as others have suggested. Of course you would need some examples of gunshots to train the neural network.
In an image you have a grid of pixels, 2 dimensions, and a color attribute for each one, so three dimensions. Here you would have the time dimension and the value of the signal at each time, so one less dimension to deal with.
The first task is to decide if the sound need to be processed as it is, or it can be averaged with a rolling average of some ms, just to reduce the amount of data. This is not critical, but would save time training the model. Of course one can proceed using the original time resolution.
The first goal is to train the network as a classifier, to tell if there is a gunshot or not in a given waveform, not mattering if there are more than one. So the input of the network is a waveform and the output is binary: has gunshots or not.
Once you have this network trained, you can detect gunshots, but not yet tell when they occour. To accomplish so, you can divide the waveform in frames of some seconds. The number of seconds of the frame should be something like the average duration of a gunshot, roughly speaking. The more narrow the more "resolution" you'll have, but too narrow wont be enough to contain the information that "represents" a gunshot. The frames must be taken at each second, in the same fashion the convolution network does, so they will overlap each other expect for one second. So for instance if you have a 60 seconds waveform and the frames are of 3 seconds, you would have 19 frames. (Overall seconds/Frame seconds) - 1.
Then, as with images, if you apply the trained network to each frame you should be able to tell the moments at were gunshots occur. The network would detect a gunshot at more than one frame, so you could have a notion of its duration. Again, this must be tuned to be precise, trying to find the minimum frame span that works.
Overall, I'm pretty sure it's doable, although not without some effort.
so many details and it touches on one idea that I had (neural networks, training for specific sound) etc.
thank you so much for this reply, so detailed too, you are amazing.
////
sorry for the giant reply, I'm not good with words, and this problem is much more complicated than it seems (both knowingly and unknowingly), but I think it's going to be worth the hassle, I'm sure I'll be spending quite a lot of time on this forum.