python speech-to-text azure-cognitive-services

Is it possible to send numpy array and sample rate to microsoft speech-to-text instead of saving this to wav file?

I'm using Microsoft Cognitive Services speech-to-text python API for transcription.

Right now, I'm getting a sound through web API (using the microphone part here: https://ricardodeazambuja.com/deep_learning/2019/03/09/audio_and_video_google_colab/) and then I write the sound to 'sound.wav' and then I send 'sound.wav' to MCS STT engine to get the transcription. The Web API gives me a numpy array together with the sample rate of the sound.

My Question is: Is it possible to send the numpy array and the sample rate directly to MCS STT instead of wrting a wav file?

Here is my code:

import azure.cognitiveservices.speech as speechsdk
import scipy.io.wavfile

audio, sr = get_audio()

p  = 'sound.wav'
scipy.io.wavfile.write(p,sr,audio)

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
audio_input = speechsdk.AudioConfig(filename=p)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

Solution

Based upon my research & looking through the code :

You will not be able to use the directly Mic in a Google Collab - because the instance in which the python gets executed - you will less likely have access/operate the same. Hence you made use of the article which facilitates in recording the audio at the web browser level.

Now - the recorded audio is in the WEBM format.As per code, they further made use of the FFMPEG in order to convert to WAV format.

But however, please note that this will have the headers in addition to the audio data

Now this is not returned in the below snippet code - instead of returning the audio,sr in the get_audio() you will have to return the riff - this is the WAV AUDIO in bytes (but this includes the header in addition to the audio data)

Came accross the post which explains the composition of the WAV file at the byte level (this can be related to the output)

http://soundfile.sapp.org/doc/WaveFormat/

In this you will have to strip out the audio data bytes,sample per second and all the necessary data & use the PushAudioInputStream method

SAMPLE

channels = 1
bitsPerSample = 16
samplesPerSecond = 16000
audioFormat = AudioStreamFormat(samplesPerSecond, bitsPerSample, channels)
custom_push_stream = speechsdk.audio.PushAudioInputStream(stream_format=audioFormat)

In this custom_push_stream - you can write the audio data to do a STT

custom_push_stream.write(audiodata)