• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Any SPECTROGRAM experts? have a question/wondering if my idea is possible

SpectrogramFan

New Member
Joined
Apr 7, 2022
Messages
4
Likes
1
So when you have a .wav file, open it on audacity (or any other software which allows spectrogram),
it instantly converts the sound into this beautiful 2D graph

Now let's say the background audio is mostly silent
And the only sound that is being heard from time to time is gunshots, so every time a gunshot is heard, it is CLEARLY nicely visible in the spectrogram.
It just fills it up in lots of "frequency ranges" and then drops off back into silence.

My question is:
Would there be a way to code something that could tell apart WHEN each of those sounds starts?

so for example - silence - GUNSHOT - the millisecond that the spectrogram shows a gunshot-like pattern - then BOOM, recognize it, say: in this exact millisecond, a gunshot was heard.

so the goal of the first question is:
would it be possible to create a program that could COUNT the number of gunshots heard in an audio file?
(i realize this may be possible without needing spectrogram but I think it's a much better way to visualize it, and might be easier)

///
Second question:

if a gunshot is heard, the sound "lingers on" for a moment before fading into silence
if there were to be a SECOND gunshot during the "drop-off" period, then would it be possible to teach the program to recognize it?
what if both sounds were super close together? 300ms apart? 200ms apart? 100ms apart?


what are the limitations that one would face when trying to code something like this, if even possible?
 

HarmonicTHD

Major Contributor
Joined
Mar 18, 2022
Messages
3,326
Likes
4,835
So when you have a .wav file, open it on audacity (or any other software which allows spectrogram),
it instantly converts the sound into this beautiful 2D graph

Now let's say the background audio is mostly silent
And the only sound that is being heard from time to time is gunshots, so every time a gunshot is heard, it is CLEARLY nicely visible in the spectrogram.
It just fills it up in lots of "frequency ranges" and then drops off back into silence.

My question is:
Would there be a way to code something that could tell apart WHEN each of those sounds starts?

so for example - silence - GUNSHOT - the millisecond that the spectrogram shows a gunshot-like pattern - then BOOM, recognize it, say: in this exact millisecond, a gunshot was heard.

so the goal of the first question is:
would it be possible to create a program that could COUNT the number of gunshots heard in an audio file?
(i realize this may be possible without needing spectrogram but I think it's a much better way to visualize it, and might be easier)

///
Second question:

if a gunshot is heard, the sound "lingers on" for a moment before fading into silence
if there were to be a SECOND gunshot during the "drop-off" period, then would it be possible to teach the program to recognize it?
what if both sounds were super close together? 300ms apart? 200ms apart? 100ms apart?


what are the limitations that one would face when trying to code something like this, if even possible?
A quick and easy way is to look at the signal in the time domain (waveform) and not in the frequency domain (spectrograph). You should be able to visually identify the gunshots in the waveform, I would guess. Any DAW software allows you to do that and probably some others too …
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
In the simple case, cross correlation of original track with a sample of a gunshot would produce a third signal with peaks at the samples where the gunshots occur. Hence, giving the times.
 

DVDdoug

Major Contributor
Joined
May 27, 2021
Messages
3,024
Likes
3,980
If you're recording, of course the recording includes the (relative) time (but not the time of day). The time resolution for the spectrogram is lower than the raw data or the waveform and that's just the nature of FFT* and it's the nature of the data... A couple of samples don't contain any frequency information. You'll see that if you zoom-in to where you can see the individual samples in Audacity.

People doing scientific work/research often use MATLAB (or one of the MATLAB clones) to run FFT and analyze audio. I've never used MATLAB but from what I understand it can read WAV files. You'd still probably have to create an algorithm (depending on what's built-in) so it's not that much different from writing an application.


* FFT is the data used for the spectrum or spectrogram display.
 

paddycrow

Senior Member
Forum Donor
Joined
Oct 28, 2019
Messages
342
Likes
576
Location
Grand Haven, MI
There is a learning curve with MATLAB, but it is very useful. As I recall, it's also not cheap, so my former employer didn't allow everyone to have it on their laptop.
 
Joined
Mar 20, 2022
Messages
17
Likes
19
Location
USA
Former DSP developer, here...

Yes, it's possible. I did something similar about a decade ago to synchronize different mic sources that may have spanned several dozens of yards, as well as different record start times. The initial/rough sync correlates FFT signatures, then time domain syncs with greater precision.

Point is: it's possible to identify transients both in the frequency and time domains, and likewise possible to count the number of incidents. Pulling close incidents apart is pretty trivial and could likely be accurate to within tens of milliseconds. An issue that could arise would be environmental echo. A single shot with an echo could be reliably identified if there was more than 1 shot. However, it may be difficult to distinguish between an echo and a fully automatic weapon, *or* pulling apart echos from *many* shots fired automatically in quick succession.

Making the distinction between gunshots and the surprisingly large number of sounds that are similar to gunshots (but NOT) is a totally different issue. (Just ask car enthusiasts in Chicago.)

What is the purpose of this thought experiment?
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,501
Likes
1,980
Location
La Garriga, Barcelona
I would apply the techniques of image recognition using convolutional neural networks over the waveform and not the spectrogram, as others have suggested. Of course you would need some examples of gunshots to train the neural network.

In an image you have a grid of pixels, 2 dimensions, and a color attribute for each one, so three dimensions. Here you would have the time dimension and the value of the signal at each time, so one less dimension to deal with.

The first task is to decide if the sound need to be processed as it is, or it can be averaged with a rolling average of some ms, just to reduce the amount of data. This is not critical, but would save time training the model. Of course one can proceed using the original time resolution.

The first goal is to train the network as a classifier, to tell if there is a gunshot or not in a given waveform, not mattering if there are more than one. So the input of the network is a waveform and the output is binary: has gunshots or not.

Once you have this network trained, you can detect gunshots, but not yet tell when they occour. To accomplish so, you can divide the waveform in frames of some seconds. The number of seconds of the frame should be something like the average duration of a gunshot, roughly speaking. The more narrow the more "resolution" you'll have, but too narrow wont be enough to contain the information that "represents" a gunshot. The frames must be taken at each second, in the same fashion the convolution network does, so they will overlap each other expect for one second. So for instance if you have a 60 seconds waveform and the frames are of 3 seconds, you would have 19 frames. (Overall seconds/Frame seconds) - 1.

Then, as with images, if you apply the trained network to each frame you should be able to tell the moments at were gunshots occur. The network would detect a gunshot at more than one frame, so you could have a notion of its duration. Again, this must be tuned to be precise, trying to find the minimum frame span that works.

Overall, I'm pretty sure it's doable, although not without some effort.
 
OP
S

SpectrogramFan

New Member
Joined
Apr 7, 2022
Messages
4
Likes
1
surprised by the many replies, really an amazing community, I'll try to give you a clearer idea of this, since I've done some more research and I'm still not 100% sure of whether my exact scope is possible or impossible (I may have over simplified it in my post):


TWO important things that I may have described wrongly:
1) i don't want to know the exact time (example: minute/hour/day/month) that gunshot A happened, I just want to know the relative time of it.
so if it happens 01.33s into my audio file - then I would want my program/plugin to be able to say: gunshot A happened at 01.33s

2) I do not have multiple audio sources/microphones recording this specific audio - so I cannot overlap them to get exact timing info by "clearing" the background noise


This sounds like the equivalent if meteor detection. Not an expert but found this which may put you on the right track https://www.meteornews.net/2021/03/30/automated-feature-extraction-from-radio-meteor-spectrograms/

Regards Andrew
^ I've just taken a look and - wow. the fact that this was possible gives me hope, I am stunned by what seems to be the precision with which everything was labeled, especially the ones closer together. STUNNING. I wish I was smart enough to fully get a grasp of this, I will be reading and researching, and definitely looking for an expert, this doesn't look like something I can do by myself.

A quick and easy way is to look at the signal in the time domain (waveform) and not in the frequency domain (spectrograph). You should be able to visually identify the gunshots in the waveform, I would guess. Any DAW software allows you to do that and probably some others too …
100% understand this. this would be the way to go if the gunshots were more far away, but maybe I'm missing something, because...
when the gunshots are so near each other, when looking at the waveform, I can't tell them apart.
because of the "fading" - once a shot happens, it still lingers on, and by looking at the "fading" I can't tell if it's simple fading, or a quieter gunshot.
But after looking into this, I think I'll explore this possibility more - spectrogram just helped me visualize it more, I may attempt slowing it down

The time resolution for the spectrogram is lower than the raw data or the waveform and that's just the nature of FFT* and it's the nature of the data... A couple of samples don't contain any frequency information. You'll see that if you zoom-in to where you can see the individual samples in Audacity.


* FFT is the data used for the spectrum or spectrogram display.
i have realized FFT may not be the perfect fit here, I heard "LAC" is a better fit for this kind of thing, will need to research this more to see which one (FFT/LAC/Simply amplitude) is a better fit


Point is: it's possible to identify transients both in the frequency and time domains, and likewise possible to count the number of incidents. Pulling close incidents apart is pretty trivial and could likely be accurate to within tens of milliseconds. An issue that could arise would be environmental echo. A single shot with an echo could be reliably identified if there was more than 1 shot. However, it may be difficult to distinguish between an echo and a fully automatic weapon, *or* pulling apart echos from *many* shots fired automatically in quick succession.

Making the distinction between gunshots and the surprisingly large number of sounds that are similar to gunshots (but NOT) is a totally different issue. (Just ask car enthusiasts in Chicago.)

What is the purpose of this thought experiment?
one thing:
echo is 100% not a problem in this case, at least not to the point in which it's interfering

second:
regarding being accurate to tens of milliseconds - that is EXACTLY what i'm after.



a clearer - real example:
I know for a FACT that in a specific section of the audio, lasting 0.80s in TOTAL in audacity
there have been *16* different gunshot events.
basically, on average, one every 50 MILLISECONDS.
(Some had a longer pause - some were super, super close)

if I look at the spectrogram, first of all - it is a mess
depending on how much I zoom - or the "window size" (or bins? not sure, audacity calls them "window size"), then I can see
sometimes -
18 of them
if I increase the size, and zoom out just a little bit, 15 of them.

when I see the amplitude, it is even harder for me to tell, because there are
8 big, giant peaks, that are clear as day
while the rest are much more of a bet -

the best I've managed to "see" is 17 total in it - but I'm not reliable, each time I try, I see something different. Sometimes it seems like it's a fading of a previous one, sometimes it looks like a quieter one.

Sometimes, two gunshots are so near each other, that in the amplitude window they look as one.


another question is: what about slowing down? would it help make the software/plugin/human be able to tell it apart better?
(Slowing it down for example allows me to constantly count 15 both in amplitude and in spectrogram - so where is the 16th? I really have no reliable WAY of knowing HOW to label them - and don't think anyone has any clear instructions either, since this is so up in the air and depends on what you're analyzing)



I would apply the techniques of image recognition using convolutional neural networks over the waveform and not the spectrogram, as others have suggested. Of course you would need some examples of gunshots to train the neural network.

In an image you have a grid of pixels, 2 dimensions, and a color attribute for each one, so three dimensions. Here you would have the time dimension and the value of the signal at each time, so one less dimension to deal with.

The first task is to decide if the sound need to be processed as it is, or it can be averaged with a rolling average of some ms, just to reduce the amount of data. This is not critical, but would save time training the model. Of course one can proceed using the original time resolution.

The first goal is to train the network as a classifier, to tell if there is a gunshot or not in a given waveform, not mattering if there are more than one. So the input of the network is a waveform and the output is binary: has gunshots or not.

Once you have this network trained, you can detect gunshots, but not yet tell when they occour. To accomplish so, you can divide the waveform in frames of some seconds. The number of seconds of the frame should be something like the average duration of a gunshot, roughly speaking. The more narrow the more "resolution" you'll have, but too narrow wont be enough to contain the information that "represents" a gunshot. The frames must be taken at each second, in the same fashion the convolution network does, so they will overlap each other expect for one second. So for instance if you have a 60 seconds waveform and the frames are of 3 seconds, you would have 19 frames. (Overall seconds/Frame seconds) - 1.

Then, as with images, if you apply the trained network to each frame you should be able to tell the moments at were gunshots occur. The network would detect a gunshot at more than one frame, so you could have a notion of its duration. Again, this must be tuned to be precise, trying to find the minimum frame span that works.

Overall, I'm pretty sure it's doable, although not without some effort.
so many details and it touches on one idea that I had (neural networks, training for specific sound) etc.
thank you so much for this reply, so detailed too, you are amazing.





////
sorry for the giant reply, I'm not good with words, and this problem is much more complicated than it seems (both knowingly and unknowingly), but I think it's going to be worth the hassle, I'm sure I'll be spending quite a lot of time on this forum.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,501
Likes
1,980
Location
La Garriga, Barcelona
In any case, you have to analyze your problem from a mathematical point of view, an abstraction, and see which model is more similar to yours. Detect a meteor is a quite different problem, just look at the raw data, whereas image recognition is quite similar. For a neural network, an image or a waveform is essentially the same, so if it works with images, it will work with audio.

Where it will struggle? Like with images, if there is something very similar to gunshot that is not, or where a gunshot of a different unknown timber appears, etc. This is where the training set, the stock of gunshots to train the model, becomes crucial, and the difference between a project success or a failure.
 

KSTR

Major Contributor
Joined
Sep 6, 2018
Messages
2,762
Likes
6,174
Location
Berlin, Germany
(i realize this may be possible without needing spectrogram but I think it's a much better way to visualize it, and might be easier)
Looking at the signal in the time domain (the waveform) usually is enough to identify shots or any similar wideband pulse signals.

About a second of a large outdoor fireworks recording, cortesy of Tom Danley, waveform:
1649364323795.png


With a bit of tinkering the parameters of the spectrogram funktion of REW, we can have this:
1649364406202.png

This plot is normalized to the loudest part of each frequency band within the snippet (the third pulse, from the left) which makes the ridges easier to identify, at 9 positions along the time axis there appear to be distinct "shot" events. Time resolution seems good even at the smallest repetition interval of about 30ms.

OTOH, in the waveform we can also quite as easily identify the nine peak events and for a software approach (pattern search, basically) I'd start from there first. Only when the shots are so low in level compared to ambient noise the spectral view (and data representation) might offer more detail to "decode" events. Some pre-filtering may help in either case to remove useless signal parts that clutter the image.
To have all of this full automated in a software I would consider a major effort but basically it is doable.

------------------

@SpectrogramFan can you share your file(s)? Dropbox or Google Drive, etc?
May help with other suggestions.
^ This!

another question is: what about slowing down? would it help make the software/plugin/human be able to tell it apart better?
Humans, yes. In Adobe Audition you can simply set playback sample rate of file to a much lower rate and thus listen to it at a fraction of the original speed. This is very effective in telling events apart much better by listening.

For software, changing speed has no benefits.
However, since gun shots will probably contain a lot of high frequency pulse energy even beyond 20kHz, the higher your sample rate the higher time resolution you can achieve (and use a good wide-band mic)
 
OP
S

SpectrogramFan

New Member
Joined
Apr 7, 2022
Messages
4
Likes
1
Looking at the signal in the time domain (the waveform) usually is enough to identify shots or any similar wideband pulse signals.

About a second of a large outdoor fireworks recording, cortesy of Tom Danley, waveform:
View attachment 198369

With a bit of tinkering the parameters of the spectrogram funktion of REW, we can have this:
View attachment 198370
This plot is normalized to the loudest part of each frequency band within the snippet (the third pulse, from the left) which makes the ridges easier to identify, at 9 positions along the time axis there appear to be distinct "shot" events. Time resolution seems good even at the smallest repetition interval of about 30ms.

OTOH, in the waveform we can also quite as easily identify the nine peak events and for a software approach (pattern search, basically) I'd start from there first. Only when the shots are so low in level compared to ambient noise the spectral view (and data representation) might offer more detail to "decode" events. Some pre-filtering may help in either case to remove useless signal parts that clutter the image.
To have all of this full automated in a software I would consider a major effort but basically it is doable.

------------------


^ This!


Humans, yes. In Adobe Audition you can simply set playback sample rate of file to a much lower rate and thus listen to it at a fraction of the original speed. This is very effective in telling events apart much better by listening.

For software, changing speed has no benefits.
However, since gun shots will probably contain a lot of high frequency pulse energy even beyond 20kHz, the higher your sample rate the higher time resolution you can achieve (and use a good wide-band mic)
could you explain what is meant by "This plot is normalized to the loudest part of each frequency band within the snippet"?
I'm missing that piece of info to fully understand what you may be getting at
(plus - do the dotted black lines on the loudest pulse have anything to do with it? did you place them yourself? what does "normalizing" exactly mean?)


regarding sharing an audio file to test - I will see if I'll be able to do that, i guess it'd be much clearer to do this with an example, i'll update you on this

regarding waveform VS spectrogram:
when I look at the spectrogram, each firework sound is absolutely clear as day
when I look at the waveform (after having seen the spectrogram) i can see the nine "events" just as much but.. if I didn't know this beforehand, what would be the most reliable way for me to be able to count them there?

for example: in the 283ms section, spectrogram: (where your cursor is pointing): i can see that clearly a fireworkrelated sound was there.

now look at the waveform: between the two big firework related sounds (the one before and after 283ms):
how would I know that there's another sound there?
to me, if I didn't know it beforehand, I'd just assume there is no sound there, the previous one was quite loud, so this would look
like its the "fading"
and if I *knew* that there was an event (firework/gunshot) in between there, then I would count 2/3 peaks instead of just one.




regarding slowing down:
back to the 16 gunshot 0.8s thing I've said before:
I've slowed it down constantly, 50%, by 50%, by 50%, and after I've lenghtened the original audio to *60 times*
I was able, clear as day, to count each one

now the problem is - that I don't really know which way would be best to SLOW DOWN audio while preserving it's quality (or telltale signs for me to pick up on)
there's "change speed", for example, which I assume Audobe Audition would be doing too (I hope better than Audacity, though)
when I slow down the speed by 50%, if I look at the spectrogram view, then I can see the "max frequency" halve each time (obviously - this makes sense)
the end result after slowing it by 60 times is that the sounds are VERY deep, but I can tell them apart.

there's "change tempo" though, which says it won't change the pitch
and if I slow it down by 60 times,it sounds so synthesized, and so weird, that.. maybe I'm doing it wrong. because in my mind, I would think it's the better option


and regarding the 20khz point:
that is indeed a very good point, and was my first idea - when compared to ANY sound at all, no matter what might have been in the background,
many times I could easily tell the shots by spectrogram simply by looking at the high frequency peaks
problem is, that it doesn't seem to catch them all, which makes this unreliable for now


(really hope you'll be able to explain the REW bit - i feel like it'll be another step forward!)
 

dc655321

Major Contributor
Joined
Mar 4, 2018
Messages
1,597
Likes
2,235
what does "normalizing" exactly mean?)

Typically, it means to scale the data. Usually relative to some feature in the data itself, so making relative comparisons easier.

eg: normalizing a signal to have a max value of unity. This can be done by dividing each value by the largest of all values in the dataset.
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,501
Likes
1,980
Location
La Garriga, Barcelona
Speech recognition is perhaps the best path to investigate.

In this article for instance a recurrent neural networks is used instead of a convolutional, as I suggested before. He averages the soundwave and then calculates the spectrogram.

He is more or less following this parer.

This example is addressed specifically to tackle with sounds that have the same timbre or spectrum but that can be very quick or very large. You may find that this may be or may not be the case with gunshots, so perhaps a different approach would be necessary.

In any case, if you look how machine learning deals with speech recognition you'll find plenty of examples with the code included. Which is way better than trying to implement the improvised suggestion I wrote before. :)
 

Spkrdctr

Major Contributor
Joined
Apr 22, 2021
Messages
2,223
Likes
2,944
Also, you might be wandering into the military use for telling where shots are coming from. The systems they use hear the sound and then can point you in the exact direction of the sound. I'm not sure without looking into it in depth, I think they can even tell the distance of the shot too. So, a lot has been done and is in use regarding gunshots using digital systems, microphones etc. Pretty amazing stuff.
 

kschmit2

Active Member
Joined
Oct 8, 2018
Messages
167
Likes
270
Gun shot analysis shouldn't be much different from at least the initial step in analyzing seismic data, i.e. the distinction between no event and the initial pulse of an earthquake. The primary indicator is the rise-time of the event.
In the above example provided by @KSTR you can easily see the sharp rises (i.e. short rise-times) at the on-set of each event (explosion of fireworks).

A very detailed explantation of seismic data analysis can be found here: https://gfzpublic.gfz-potsdam.de/rest/items/item_4009_4/component/file_4010/content?download=true
 

kschmit2

Active Member
Joined
Oct 8, 2018
Messages
167
Likes
270
And to give a perspective on what has been investigated in the field of gunshot detection, have a look at 2019 paper "Development of Computational Methods for the Audio Analysis of Gunshots" by Ryan Lilien.

The goals there were:
  1. Detect gunshots in an audio recording
  2. Compute shot-to-shot timings
  3. Determine the number of firearms present in a recording and assign shots to firearms
  4. Construct a predictive model of the likely class, caliber, and make/model of recorded gunshots
No. 2 is what the OP is looking for.

Link:
 

xaviescacs

Major Contributor
Forum Donor
Joined
Mar 23, 2021
Messages
1,501
Likes
1,980
Location
La Garriga, Barcelona
Nice paper. They don't give a lot of details of each implementation, but very nice indeed. Notice that the neural network performance is better than the LDA/PCA one, as expected.

In general, for such a task, if you don't need an analytical model, which in this case doesn't makes much sense, a neural network is the best choice.
 
Top Bottom