I am trying to run the example code of webRTC VAD
found here.
But when I feed it a mono-16bit wave file of just me speaking with very long pauses, it just detects the entire file to be voiced, and the voiced output chunk-00.wav
is the entire audio file.
Any help is greatly appreciated. Below I have given the console output that I receive.
(base) gulag_dweller@Tumuls-MacBook-Pro python_transformers % python3 VAD-python.py /Users/gulag_dweller/Downloads/try_voice.wav
sample rate is: 48000 Hz
00001111111111+(0.12)11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111011111111111111111111111111111111111111111111111111111111111111111011111111111111111111111111111111111111111111100011111111111111111111111111111111111111110111111111111111111111111111111111111111110001111111111111111111111111111111111111111111111111-(16.22999999999986)
Writing chunk-00.wav
I think I have found an alternative method of how to get VAD data. Instead of trying to get VAD from the pre-defined method shown in the link above, I create my own function.
The function basically measures the amplitude of the wave and any sharp spike observed above the base noise level (1.6x the base value
) is taken to mean a voiced activity. This function assumes that only 1 human is speaking and that the noise level remains relatively constant.
y_list=list(audio_data1) # create an immutable list of amplitude values
y_vad=[] # initialise an array
max_noise = -1.0 # put the lowest value that one can
for i in range(len(time_s)):
t = time_s[i]
# Variable to store the current absolute amplitude value for the given index i
current_audio_amplitude = np.abs(audio_data1[i])
# since at the start, some issues arise, first few seconds are padded out
# and for any sudden change in |amplitude| i.e. > 60% results in stopping the program
if t>0.2 and max_noise > 0 and current_audio_amplitude > 1.6*max_noise:
print(t, current_audio_amplitude, max_noise)
break
# take the highest value of amplitude to be the max_noise
if current_audio_amplitude > max_noise:
max_noise = current_audio_amplitude
print('max-noise is: '+str(max_noise))
for i in range(len(time_s)):
# for any value amplitude that exceeds the max_noise value is taken to be a voice activity
if np.abs(audio_data1[i]) > max_noise:
y_vad.append(1)
# otherwise just take VAD value to be 0
else:
y_vad.append(0)