Search code examples
speech-recognitionibm-watsonazure-cognitive-servicesgoogle-speech-apidialogflow-es

How to identify multiple speakers and their text from an audio input?


I am using Microsoft's cognitive services. I have an audio input and need to identify multiple speakers and their individual text.

As per my understanding, Speaker Rekognition API can identify different individuals and Bing Speech API can convert speech to text. However, to do both at the same time, I need to manually split audio file into pieces (based on pause/silence) and then send the audio stream to individual services. Is there a better way to do it? Any other ecosystem that I should switch to like AWS Lex/Polly or Google's offerings?


Solution

  • You should try IBM Watson Speech to Text API. They have a feature called Speaker Diarization that will be useful for your use case.

    More details here: https://www.ibm.com/blogs/watson/2016/12/look-whos-talking-ibm-debuts-watson-speech-text-speaker-diarization-beta/