Sound file to text file-speech recognition for ubuntu, specifically pocketsphinx usage

As made clear here: https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux

Finding speech recognition software that turns sound file into text is dificult to do on linux.

I trying to use pocketsphinx_continuous command. Pocket spinx is already installed.

There are several dict files, language model files and acoustic folders that I have downloaded. I tried running the command pocketsphinx_continuous.

The command I use is: sudo pocketsphinx_continuous -dict /home/barnabas/Desktop/dict/cmudict.dict -hmm /home/barnabas/Desktop/wsj_all_sc.cd_semi_5000/ -lm /home/barnabas/Desktop/en-70k-0.1.lm -infile untitled2.wav 2> pocketsphinx.log > myspeech.txt

Now.

Without fail all outputs have a padded index on the left without any pair output text.

000000000:

I want a short list of one dictionary file, language model file, acoustic file listed please, that are compatible with each other. Thank you.

Solution

I want a short list of one dictionary file, language model file, acoustic file listed please, that are compatible with each other.

Install the pocketsphinx-en-us package from the universe/sound section. (It's available in Ubuntu 18.04 Bionic Beaver and later. Prior to that, I believe it was called pocketsphinx-hmm-en-hub4wsj.) This will put the model in /usr/share/pocketsphinx/model/en-us/.

After that, you can run commands like this (there's no need to use sudo):

pocketsphinx_continuous -infile myfile.wav 2>&1 > myspeech.txt | tee out.log | less

Or if you want to specify the folders manually:

pocketsphinx_continuous \
    -hmm /usr/share/pocketsphinx/model/en-us/en-us \
    -dict /usr/share/pocketsphinx/model/en-us/cmudict-en-us.dict \
    -lm /usr/share/pocketsphinx/model/en-us/en-us.lm.bin \
    -infile myfile.wav > myspeech.txt

Make sure you have a 16-bit, 16 kHz mono wav file, or convert if necessary:

ffmpeg -i myfile.mp3 -ar 16000 -ac 1 -sample_fmt s16 myfile.wav

You might not have the best accuracy from the generic model. Here's set #1 of the Harvard Sentences:

One: The birch canoe slid on the smooth planks.
Two: Glue the sheet to the dark blue background.
Three: It's easy to tell the depth of a well.
Four: These days a chicken leg is a rare dish.
Five: Rice is often served in round bowls.
Six: The juice of lemons makes fine punch.
Seven: The box was thrown beside the parked truck.
Eight: The hogs were fed chopped corn and garbage.
Nine: Four hours of steady work faced us.
Ten: Large size in stockings is hard to sell.

and here's the output I got from my recording:

if one half the brcko nude lid on this good length
to conclude ishii to the dark blue background
three it's easy to tell the devil wow
for these days eat chicken leg is a rare dish
five race is often served in round polls
six the juice of the lemons makes flying conch
seven the box was thrown beside the parked truck
eight the hogs griffin chopped coroner and garbage
not in four hours of steady work the stocks
ten large son is in stockings his heart is the good