python audio webrtc speech-recognition librosa

Audio signal split at word level boundary

I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence. Is there any way by which the split can be done at word level boundry condition? (after each spoken word)? If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file. One simple solution or way to split by ffmpeg is also defined by :

https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e

but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal. As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:

www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz

Solution

Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.

In your audio sample there is a radio host speaking rather quickly in German. Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.

All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available. These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.

Word segmentation using Speech Recognition library

All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.

Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.


import sys
import os
import subprocess
import json
import math

# tested with VOSK 0.3.15
import vosk
import librosa
import numpy
import pandas



def extract_words(res):
   jres = json.loads(res)
   if not 'result' in jres:
       return []
   words = jres['result']
   return words

def transcribe_words(recognizer, bytes):
    results = []

    chunk_size = 4000
    for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
        start = chunk_no*chunk_size
        end = min(len(bytes), (chunk_no+1)*chunk_size)
        data = bytes[start:end]

        if recognizer.AcceptWaveform(data):
            words = extract_words(recognizer.Result())
            results += words
    results += extract_words(recognizer.FinalResult())

    return results

def main():

    vosk.SetLogLevel(-1)

    audio_path = sys.argv[1]
    out_path = sys.argv[2]

    model_path = 'vosk-model-small-de-0.15'
    sample_rate = 16000

    audio, sr = librosa.load(audio_path, sr=16000)

    # convert to 16bit signed PCM, as expected by VOSK
    int16 = numpy.int16(audio * 32768).tobytes()

    # XXX: Model must be downloaded from https://alphacephei.com/vosk/models
    # https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
    if not os.path.exists(model_path):
        raise ValueError(f"Could not find VOSK model at {model_path}")

    model = vosk.Model(model_path)
    recognizer = vosk.KaldiRecognizer(model, sample_rate)

    res = transcribe_words(recognizer, int16)
    df = pandas.DataFrame.from_records(res)
    df = df.sort_values('start')

    df.to_csv(out_path, index=False)
    print('Word segments saved to', out_path)

if __name__ == '__main__':
    main()

Run the program with the .WAV file and the path to an output file.

python vosk_words.py attached_problem/main.wav out.csv

The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:

conf,end,start,word
0.618949,1.11,0.84,also
1.0,1.32,1.116314,eine
1.0,1.59,1.32,woche
0.411941,1.77,1.59,des

Comparing the output (bottom) with the example file you provided (top), it looks pretty good.

It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.