speech-recognition ibm-watson azure-cognitive-services google-speech-api dialogflow-es

How to identify multiple speakers and their text from an audio input?

I am using Microsoft's cognitive services. I have an audio input and need to identify multiple speakers and their individual text.

As per my understanding, Speaker Rekognition API can identify different individuals and Bing Speech API can convert speech to text. However, to do both at the same time, I need to manually split audio file into pieces (based on pause/silence) and then send the audio stream to individual services. Is there a better way to do it? Any other ecosystem that I should switch to like AWS Lex/Polly or Google's offerings?

Solution

You should try IBM Watson Speech to Text API. They have a feature called Speaker Diarization that will be useful for your use case.

More details here: https://www.ibm.com/blogs/watson/2016/12/look-whos-talking-ibm-debuts-watson-speech-text-speaker-diarization-beta/

INVALID_ARGUMENT: Request payload size exceeds the limit: 10485760 bytes
How can I get word-level timestamps in OpenAI's Whisper ASR?
PermissionStatus API: Safari appears to support the change event but nothing fires when user allows microphone
Split speech audio file on words in python
How to run RecognitionListener at the background of the app?
ModelCheckpoint not saving the hdf5 file
Google Cloud Speech: Distinguish Voices?
How to automatically generate subtitles for a video and translate them in NextJS
TypeError: Cannot read properties of undefined (reading 'kind')
Azure speech continuous voice recognition from microphone
Microsoft Speech to Text Python SDK SPXERR_INVALID_HEADER issue
How to recognize an audio when i provide a list of more than 4 language in azure using recognize_once()?
SpeechRecognition is not working in firefox
How can I implement real-time sentiment analysis on live audio streams using Python?
How to continuously to do speech recognition while outputting the recognized word as soon as possible
I'm not able to see my text output using speech_recognition in Python
How to prompt Languages List of Google Voice Input Settings screen programmatically
How is the data used for speech recognition collected and prepared?
An App that counts the number of spoken words by using Speech Recognition
Flutter Speech to text not listening continuously
Flutter/Dart: speech to text (offline and continuous) for any language
URLs in iOS 17's New Speech Recognition API: prepareCustomLanguageModel vs Configuration URL
Python multilingual SpeechRecognition
Is it possible to integrate the Speech-To-Text API from Google into my website
SpeechRecognition and SpeechSynthesis in TypeScript
google speech to text not working correctly with very short audio (single words)
How can I dynamically handle a call on Twilio in real time and with Python?
Implementing a real time speech recognition using web Media Recorder API in React for the Front-End and Python for back-end
SpeechSynthesizer doesn't get all installed voices
SpeechRecognizer, bind to recognition service failed