Search code examples
pythonaudiowebrtc

webRTC: getting VAD data on WAV audio in python


I am trying to run the example code of webRTC VAD found here.

But when I feed it a mono-16bit wave file of just me speaking with very long pauses, it just detects the entire file to be voiced, and the voiced output chunk-00.wav is the entire audio file.

Any help is greatly appreciated. Below I have given the console output that I receive.

(base) gulag_dweller@Tumuls-MacBook-Pro python_transformers % python3 VAD-python.py /Users/gulag_dweller/Downloads/try_voice.wav 
sample rate is: 48000 Hz
00001111111111+(0.12)11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111011111111111111111111111111111111111111111111111111111111111111111011111111111111111111111111111111111111111111100011111111111111111111111111111111111111110111111111111111111111111111111111111111110001111111111111111111111111111111111111111111111111-(16.22999999999986)
 Writing chunk-00.wav

Solution

  • I think I have found an alternative method of how to get VAD data. Instead of trying to get VAD from the pre-defined method shown in the link above, I create my own function.

    The function basically measures the amplitude of the wave and any sharp spike observed above the base noise level (1.6x the base value) is taken to mean a voiced activity. This function assumes that only 1 human is speaking and that the noise level remains relatively constant.

    y_list=list(audio_data1) # create an immutable list of amplitude values
    y_vad=[] # initialise an array
    max_noise = -1.0 # put the lowest value that one can
    
    for i in range(len(time_s)):
        t = time_s[i]
    
        # Variable to store the current absolute amplitude value for the given index i
        current_audio_amplitude = np.abs(audio_data1[i])
        # since at the start, some issues arise, first few seconds are padded out 
        # and for any sudden change in |amplitude| i.e. > 60% results in stopping the program 
        if t>0.2 and max_noise > 0 and current_audio_amplitude > 1.6*max_noise:
            print(t, current_audio_amplitude, max_noise)
            break
        # take the highest value of amplitude to be the max_noise
        if current_audio_amplitude > max_noise:
            max_noise = current_audio_amplitude
    print('max-noise is: '+str(max_noise))
    
    for i in range(len(time_s)):
        # for any value amplitude that exceeds the max_noise value is taken to be a voice activity
        if np.abs(audio_data1[i]) > max_noise:
            y_vad.append(1)
        # otherwise just take VAD value to be 0
        else:
            y_vad.append(0)