• Welcome to ASR. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

help: linux | compare wav file | confidence level

fatoldgit

Addicted to Fun and Learning
Joined
Feb 29, 2020
Messages
531
Likes
759
So I have written (in a ksh script) a voice control interface for my music playback GUI and it works well except for one situation.

The issue I have is all the english models (across any number of voice recog tools I have tested) are based on US English.

But I am a Kiwi (NZ'er) and every one of these available Linux VRT's, while they work fine for keywords like "up", "down", "left", "right", "next", "previous" etc, do a crap job with single alpha utterances (A, B, C, D etc).

They can't tell my "A" from my "I", my "B" from my "E" etc. I even tried non-english models (I dont care what text they spit out as long as it's unique)

Some models appear to support training but aside from the complexity of that (it involves lots of discrete steps), you literally have to install many GB's (sometimes > 10GB) of "stuff" to do this.

Also a primary concern is something that will be installable in 10, 20,30 years time and must be "offline" (i.e. does not need a cloud resource).

So the requirement is simple: I have say 50 utterances that need to be recognized and I can easily setup a loop that listens and records from a mic and can trim out the pre and post silence.

What I haven't been able to find is some simple, durable Linux set of CLI commands that can compare a captured utterance against a master WAV file (the loop is simple: grab the sound from the mic and loop through comparing against my master set of WAV file utterances).

No issue having to compile from source...In fact I prefer that as it means I can support that well into the future (my entire playback stack is compiled from C/C++ source code)

So what I need is something that can compare two WAV files and produce (as a stdout value) a "confidence" level that they are or aren't the same...whether that is a direct compare or needs to produce a secondary file (a "fingerprint") which is then compared, I don't care.

Any help greatly appreciated.


Peter

PS. You can probably tell from my use of ksh, C and C++ that I am an old skool Unix, latterly, Linux dev of some 45+ years experience so these are my "go to" tools for backend software development (popen is very robust for integrating C/C++ with a ksh script where you need to interrogate the return values and with ksh you have the whole world of the Linux command line at your disposal [find, awk, sed, cat, sort, grep, xdotool etc etc])
 
Last edited:
Hmm, I don't have a good suggestion for you but I think comparing audio waveforms directly is not it.

Two recordings of you saying 'A' will be extremely different on a sample-by-sample basis.

And it sounds like standard speech to text is not great either due to it being overly abstracted.

For this specific usage I think you need a speech processing library that detects sub-phoneme features like formants, plosives, etc. Then you can detect which letter is being said yourself.

I don't know if such a library exists, sorry! Just trying to get my head around the issue.
 
Hmm, I don't have a good suggestion for you but I think comparing audio waveforms directly is not it.

Two recordings of you saying 'A' will be extremely different on a sample-by-sample basis.

And it sounds like standard speech to text is not great either due to it being overly abstracted.

For this specific usage I think you need a speech processing library that detects sub-phoneme features like formants, plosives, etc. Then you can detect which letter is being said yourself.

I don't know if such a library exists, sorry! Just trying to get my head around the issue.
Thanks for that.

Of course the issue I have is I have no grounding in the language of speech recognition so you have at least given me some key words to search for ( sub-phoneme features like formants, plosives, etc).

So off I head to google armed with some new words (which I need to understand first !!!!)

Thanks again

Peter
 
So off I head to google armed with some new words (which I need to understand first !!!!)
Formants are basically the frequency peaks we use to identify vowels (at least that's how I understand it), plosives are the "puff of air" sounds that happen when we say P or B words, you also have "fricatives" which are like "sh" or "f" sounds... My thinking is basically just that if speech detection is assuming too much, then you can go upstream a step and detect sounds instead of letters. That is a few steps beyond raw audio, so you aren't left trying to use something like Deltawave at least.

Glad to provide a new idea here, but I'm afraid that's all I've got to offer! Hopefully it leads to something useful.
 
Ok

So I buggered around with a tool which does vowel analysis called "praat" (very well known in this space with academics) for about 3 hours and gave up (at least on that). It's a great tool, has it's own powerful scripting language but didn't work for my use case.

I went back to searching for linux speech to text software and after about 2 hours I found what I needed (after testing it).

Note I avoid anything in Python simply because they are a pain to keep running from Linux release to release... typically needs a complete reinstall via "pip" which means if some git repository dies you are dead in the water (plus they need tons of dependencies and dump stuff everywhere)

My new love is called "whisper.cpp" which is a command line C++ interface into a downloadable model.

So it fits the bill in terms of longevity: the downloaded model(s) sit under the same directory as the C++ source code so just need to recompile (given the model exists) when upgrading Linux...so totally standalone/portable.

Also doesn't reach out to anything on the web during runtime (tested with my ethernet connection disabled).

I tested various size models and actually found the "tiny" model best. The really large models tried "to hard" and got my 'A' wrong... always returned 'I'. The tiny model returns "bye" for "A" which is fine. All other alpha characters worked fine and so did key words like "home", "top", "stop","pause","back" etc.

The other cool thing is the project is still active with recent updates (many others I looked at over the last year were dead).

I have tried many TTS Linux programs and this is the easiest to install and is very accurate (at least with what I need).

The install is simple (noting that my Linux image has tons of installed development libraries so I didn't hit any dependency issues... someone else might):

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
sh ./models/download-ggml-model.sh tiny.en
cmake -B build
cmake --build build --config Release

Then for simple testing I used "arecord -D hw:CARD=ATR4650USB,DEV=0 --duration=2 -f cd -c 1 test.wav;./build/bin/whisper-cli -f test.wav -m models/ggml-tiny.en.bin -np -t 4]"

Strangely (as far as I can see) it doesnt support "piped" input... I will probably hack it once it's bedded in to support this. I will see if it can read from a named pipe in it's current state...which will be just as good.

Results for "A","B" and "C" as below.

============== A =====================
Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 44100 Hz, Mono
[00:00:00.000 --> 00:00:02.000] Bye.
============== B ====================
Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 44100 Hz, Mono
[00:00:00.000 --> 00:00:02.000] Be.
============== C ====================
Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 44100 Hz, Mono
[00:00:00.000 --> 00:00:02.000] See.

It's a drop in replacement for the STT app I am using now (which I hacked to output the translated text to a file) which is then read by the ksh script that drives the GUI around. Just need to mod the loop that acquires the text to use the above code as it's basis and all is good.

Thanks for your help... while it didnt ultimately help me directly, it did force me to keep looking for a TTS solution

Peter

ps. the tool has lots of options that I havent play much with:

options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
-p N, --processors N [1 ] number of processors to use during computation
-ot N, --offset-t N [0 ] time offset in milliseconds
-on N, --offset-n N [0 ] segment index offset
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-sow, --split-on-word [false ] split on word rather than on token
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [5 ] beam size for beam search
-ac N, --audio-ctx N [0 ] audio context size (0 - all)
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-nth N, --no-speech-thold N [0.60 ] no speech threshold
-tp, --temperature N [0.00 ] The sampling temperature, between 0 and 1
-tpi, --temperature-inc N [0.20 ] The increment of temperature, between 0 and 1
-debug, --debug-mode [false ] enable debug mode (eg. dump log_mel)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-tdrz, --tinydiarize [false ] enable tinydiarize (requires a tdrz model)
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-olrc, --output-lrc [false ] output result in a lrc file
-owts, --output-words [false ] output script for generating karaoke video
-fp, --font-path [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-oj, --output-json [false ] output result in a JSON file
-ojf, --output-json-full [false ] include more information in the JSON file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-np, --no-prints [false ] do not print anything other than the results
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [false ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
-dl, --detect-language [false ] exit after automatically detecting language
--prompt PROMPT [ ] initial prompt (max n_text_ctx/2 tokens)
-m FNAME, --mosper.cpp/models/ggml-medium.en.bin del FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input audio file path
-oved D, --ov-e-device DNAME [CPU ] the OpenVINO device used for encode inference
-dtw MODEL --dtw MODEL [ ] compute token-level timestamps
-ls, --log-score [false ] log best decoder scores of tokens
-ng, --no-gpu [false ] disable GPU
-fa, --flash-attn [false ] flash attention
-sns, --suppress-nst [false ] suppress non-speech tokens
--suppress-regex REGEX [ ] regular expression matching tokens to suppress
--grammar GRAMMAR [ ] GBNF grammar to guide decoding
--grammar-rule RULE [ ] top-level GBNF grammar rule name
--grammar-penalty N [100.0 ] scales down logits of nongrammar tokens
 
Last edited:
Back
Top Bottom