Search code examples
pythonaudiowebrtcspeech-recognitionlibrosa

Audio signal split at word level boundary


I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence. Is there any way by which the split can be done at word level boundry condition? (after each spoken word)? If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file. One simple solution or way to split by ffmpeg is also defined by :

https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e

but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal. As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:

www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz


Solution

  • Simple audio segmentation problems can be handled by using a Hidden Markov Model, after preprocessing the audio into suitable features. Typical features for speech would be soundlevel, vocal activity / voicedness. To get word-level segmentation (as opposed to sentence), this needs to have rather high time resolution. Unfortunately the pyWebRTCVAD does not have adjustable time smoothening so it might not be suited for the task.

    In your audio sample there is a radio host speaking rather quickly in German. Looking at the soundlevels wrt to the word boundaries you have marked it is clear that between some words the soundlevel doesnt really drop. That rules out a simple soundlevel segmentation model.

    All in all, getting good results for general speech signals can be quite hard. But fortunately this is very well researched, and with off-the-shelf solutions being available. These use typically an acoustic model (how words and phonemes sound), as well as a language model (likely orders of words), learned over many hours of audio.

    Word segmentation using Speech Recognition library

    All these features are included in a Speech Recognition framework, and many allow to get word-level outputs with timing. Below is some working code for this using Vosk.

    Alternatives to Vosk would be PocketSphinx. Or using an online speech recognition service from Google Cloud, Amazon Web Services, Azure Cloud etc.

    
    import sys
    import os
    import subprocess
    import json
    import math
    
    # tested with VOSK 0.3.15
    import vosk
    import librosa
    import numpy
    import pandas
    
    
    
    def extract_words(res):
       jres = json.loads(res)
       if not 'result' in jres:
           return []
       words = jres['result']
       return words
    
    def transcribe_words(recognizer, bytes):
        results = []
    
        chunk_size = 4000
        for chunk_no in range(math.ceil(len(bytes)/chunk_size)):
            start = chunk_no*chunk_size
            end = min(len(bytes), (chunk_no+1)*chunk_size)
            data = bytes[start:end]
    
            if recognizer.AcceptWaveform(data):
                words = extract_words(recognizer.Result())
                results += words
        results += extract_words(recognizer.FinalResult())
    
        return results
    
    def main():
    
        vosk.SetLogLevel(-1)
    
        audio_path = sys.argv[1]
        out_path = sys.argv[2]
    
        model_path = 'vosk-model-small-de-0.15'
        sample_rate = 16000
    
        audio, sr = librosa.load(audio_path, sr=16000)
    
        # convert to 16bit signed PCM, as expected by VOSK
        int16 = numpy.int16(audio * 32768).tobytes()
    
        # XXX: Model must be downloaded from https://alphacephei.com/vosk/models
        # https://alphacephei.com/vosk/models/vosk-model-small-de-0.15.zip
        if not os.path.exists(model_path):
            raise ValueError(f"Could not find VOSK model at {model_path}")
    
        model = vosk.Model(model_path)
        recognizer = vosk.KaldiRecognizer(model, sample_rate)
    
        res = transcribe_words(recognizer, int16)
        df = pandas.DataFrame.from_records(res)
        df = df.sort_values('start')
    
        df.to_csv(out_path, index=False)
        print('Word segments saved to', out_path)
    
    if __name__ == '__main__':
        main()
    

    Run the program with the .WAV file and the path to an output file.

    python vosk_words.py attached_problem/main.wav out.csv
    

    The script outputs words and their times in the CSV. These timings can then be used to split the audio file. Here is example output:

    conf,end,start,word
    0.618949,1.11,0.84,also
    1.0,1.32,1.116314,eine
    1.0,1.59,1.32,woche
    0.411941,1.77,1.59,des
    

    Comparing the output (bottom) with the example file you provided (top), it looks pretty good.

    enter image description here

    It actually picked up a word that your annotations did not include, "und" at 42.25 seconds.