c++clinuxaudiospeech

how to separate an audio file based on different speakers


I have a bunch of audio files about telephone conversation. I want to try to split an audio file into two, each contains only one speaker's speech. Maybe I need to use speech diarization. But how can I do that? anybody can give me some clues? Thank you. ps: Linux OS.C/C++


Solution

  • While separating the individual speakers is quite a difficult problem you can automatically split the audio where there are pauses. This would produce a series of files that would likely be easier to manage since speakers often alternate between pauses.

    This approach requires the open source Julius speech recognition decoder package. This is available in many Linux package repositories. I use the Ubuntu multiverse repository.

    Here is the site: http://julius.sourceforge.jp/en_index.php


    Step 0: Install Julius

    sudo apt-get install julius
    

    Step 1: Segment Audio

    adintool -in file -out file -filename myRecording.wav -startid 0 -freq 44100 -lv 2048 -zc 30 -headmargin 600 -tailmargin 600
    
    • -startid is the starting segment number that will be appended to the filename

    • -freq is the sample rate of the source audio file

    • -lv is the level of the audio above which voice detection will be active

    • -zc is the zero crossings above which voice detection will be active

    • -headmargin and -tailmargin is the amount of silence before and after each audio segment

    Note that -lv and -zc will have to be adjusted for your particular audio recording's attributes while -headmargin and -tailmargin will have to be adjusted for your particular speaker's styles. But the values given above have worked well for my voice recordings in the past.

    Here is the documentation: http://julius.sourceforge.jp/juliusbook/en/adintool.html


    In my experience preprocessing the audio using compression and normalization gives better results and requires less adjustment of the Julius arguments. These initial steps are recommended but not required.

    This approach requires the open source SoX audio toolkit package. This is also available in many Linux package repositories. I use the Ubuntu universe repository.

    Here is the site: http://sox.sourceforge.net


    Step -2: Install SoX

    sudo apt-get install sox
    

    Step -1: Preprocess Audio

    sox myOriginalRecording.wav myRecording.wav gain -b -n -8 compand 0.2,0.6 4:-48,-32,-24 0 -64 0.2 gain -b -n -2
    
    • gain -b -n balances and normalizes the audio to a given level

    • compand compresses (in this case) the audio based on the parameters

    Note that compand may require some time to completely understand the parameters. But the values given above have worked well for my voice recordings in the past.

    Here is the documentation: http://sox.sourceforge.net/sox.html


    While this will not give you identification of each speaker it will greatly simplify the task of doing it by ear, which may end up being the only option for a while. But I do hope you find practical solution if it is already available.