Search code examples
speech-recognitionspeech-to-textcmusphinx

How to combine speech recognition and speaker diarization?


I am trying to combine speech recognition and speaker diarization techniques to identify how many speakers are present in an conversation and which speaker said what.

For this I am using CMU Sphinx and LIUM Speaker Diarization.

I am able to run these two tools separately i.e. I can run Sphinx 4 and get text output from audio and run LIUM toolkit and get audio segments.

Now I want to combine these two and get output something like below :

s0 : this is my first sentence.
s1 : this is my reply.
s2: i do not what you are talking about

Does anyone knows how to combine these two toolkit?


Solution

  • Run diarization tools to get segment times for each speaker. They look like this:

    file1 1 16105 217 M S U S9_file1
    file1 1 16322 1908 M S U S9_file1
    file2 1 18232 603 M S U S9_file2
    

    The numbers like 16106 and 217 are segment start and segment length. Parse the text output and store times in the array.

    Then split original audio on segments using the times.

    Process each segment separately with Sphinx4 and display the transcription.

    Optionally, run speaker adaptation for segments of each speaker and process each segment again with speaker-adapted model.