Search code examples
javaspeech-recognitionsphinx4voice-detection

Voice Activity Detection (VAD/SAR) with LIUM


I wrote a shell script to train several GMMs for some kinds of voice activity and silence. I used LIUM speaker diarization toolkit therefore. I want to use this to do voice activity detection. The following script extracts MFCC features from an wav audio file by using Sphinx4, trains GMMs on these and applies Viterbi decoding for segmentation. However, the results are very poor, i.e. the resulting segmentation is completely wrong. This should definitely not be the case, since I am applying the GMMs on the training set itself. What am I doing wrong? I have put a lot of effort in this and still cannot get it working. Thank you a lot for any help in advance!

BTW: I double checked the input format of my wav file, which is mono 16bit LE according to Sphinx4 documentation. Furthermore, I tried many different parameter settings, especially parameters like emCtrl (training of the GMMs) and dPenalty (Viterbi decoding for segmentation). Nothing helped for me.

Here is my shell script:

# !/bin/bash

wav=$1
base=`basename $wav .wav`
show=$base
fDescIn="audio16kHz2sphinx,1:1:0:0:0:0,13,0:0:0"
fDescOut="sphinx,1:1:0:0:0:0,13,0:0:0"
features="./%s.mfcc"
seg="./%s.seg"
gmmInit="./%s.init.gmms" # output GMM, %s is replaced by $show
gmm="./%s.gmms"

# Extract MFCC features
java -Xmx2048m -classpath lium.jar \
fr.lium.spkDiarization.tools.Wave2FeatureSet \
--fInputMask=$wav --sInputMask="" --fInputDesc=$fDescIn \
--fOutputMask=$base.mfcc --fOutputDesc=fDescOut $show

# Initialize the GMM 
java -Xmx1024m -cp lium.jar \
fr.lium.spkDiarization.programs.MTrainInit \
--sInputMask=$show".seg" --fInputMask=$base.mfcc
--fInputDesc=$fDescOut --kind=DIAG --nbComp=16 \
--emInitMethod=split_all --emCtrl=1,5,0.05 --tOutputMask=$gmmInit $show

# Train GMMs via EM
java -Xmx1024m -cp lium.jar \
fr.lium.spkDiarization.programs.MTrainEM \
--sInputMask=$show".seg" --fInputMask=$base.mfcc --emCtrl=10,20,0.01 \
--fInputDesc=$fDescOut --tInputMask=$gmmInit --tOutputMask=$gmm $show

# Segmentation
iseg=./$datadir/$show.i.seg
pmsseg=./$datadir/$show.pms.seg
java -Xmx2048m -cp lium.jar \
fr.lium.spkDiarization.programs.MDecode \
--fInputDesc=$fDescOut --fInputMask=$base.mfcc --sInputMask=$show.out2.seg \
--sOutputMask=$show.result.seg --dPenality=1,1,1,1 --tInputMask=$gmm $show

Solution

  • Adding a ":1" to the end of fDescIn and fDescOut worked. This specifies the normalization method, i.e. cluster-wise in this case. ":0" for segment-wise also works and achieves comparable results.

    The code examples from LIUM's official website are wrong in this respect.