Search code examples
pocketsphinxcontinuous

Sound file to text file-speech recognition for ubuntu, specifically pocketsphinx usage


As made clear here: https://unix.stackexchange.com/questions/256138/is-there-any-decent-speech-recognition-software-for-linux

Finding speech recognition software that turns sound file into text is dificult to do on linux.

I trying to use pocketsphinx_continuous command. Pocket spinx is already installed.

There are several dict files, language model files and acoustic folders that I have downloaded. I tried running the command pocketsphinx_continuous.

The command I use is: sudo pocketsphinx_continuous -dict /home/barnabas/Desktop/dict/cmudict.dict -hmm /home/barnabas/Desktop/wsj_all_sc.cd_semi_5000/ -lm /home/barnabas/Desktop/en-70k-0.1.lm -infile untitled2.wav 2> pocketsphinx.log > myspeech.txt

Now.

Without fail all outputs have a padded index on the left without any pair output text.

000000000:

I want a short list of one dictionary file, language model file, acoustic file listed please, that are compatible with each other. Thank you.


Solution

  • I want a short list of one dictionary file, language model file, acoustic file listed please, that are compatible with each other.

    Install the pocketsphinx-en-us package from the universe/sound section. (It's available in Ubuntu 18.04 Bionic Beaver and later. Prior to that, I believe it was called pocketsphinx-hmm-en-hub4wsj.) This will put the model in /usr/share/pocketsphinx/model/en-us/.

    After that, you can run commands like this (there's no need to use sudo):

    pocketsphinx_continuous -infile myfile.wav 2>&1 > myspeech.txt | tee out.log | less
    

    Or if you want to specify the folders manually:

    pocketsphinx_continuous \
        -hmm /usr/share/pocketsphinx/model/en-us/en-us \
        -dict /usr/share/pocketsphinx/model/en-us/cmudict-en-us.dict \
        -lm /usr/share/pocketsphinx/model/en-us/en-us.lm.bin \
        -infile myfile.wav > myspeech.txt
    

    Make sure you have a 16-bit, 16 kHz mono wav file, or convert if necessary:

    ffmpeg -i myfile.mp3 -ar 16000 -ac 1 -sample_fmt s16 myfile.wav
    

    You might not have the best accuracy from the generic model. Here's set #1 of the Harvard Sentences:

    One: The birch canoe slid on the smooth planks.
    Two: Glue the sheet to the dark blue background.
    Three: It's easy to tell the depth of a well.
    Four: These days a chicken leg is a rare dish.
    Five: Rice is often served in round bowls.
    Six: The juice of lemons makes fine punch.
    Seven: The box was thrown beside the parked truck.
    Eight: The hogs were fed chopped corn and garbage.
    Nine: Four hours of steady work faced us.
    Ten: Large size in stockings is hard to sell.
    

    and here's the output I got from my recording:

    if one half the brcko nude lid on this good length
    to conclude ishii to the dark blue background
    three it's easy to tell the devil wow
    for these days eat chicken leg is a rare dish
    five race is often served in round polls
    six the juice of the lemons makes flying conch
    seven the box was thrown beside the parked truck
    eight the hogs griffin chopped coroner and garbage
    not in four hours of steady work the stocks
    ten large son is in stockings his heart is the good
    

    Related: