azure azure-cognitive-services azure-speech azure-text-translation

Process audio from Byte Stream or file without saving to disk Azure Speech SDK Python

I have a flask app that gets audio files posted as formdata, We want to process these audio files with Azure Speech SDK to extract the text from the speech.

But to improve performance, I would like to process the audio files without writing them to the server's disk.

But Azure Speech SDK seems to work only with the filename properly. I'm not able to pass the file as AudioInputStream.

Could someone please help me in processing the file without saving it to the disk.

def process_audio_files():
file = request.files['audio-file']
stream = file.stream

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

stream = speechsdk.audio.PushAudioInputStream(stream_format= stream)

#How to pass the file in the AudioConfig as parameter without saving to the disk?

audio_config = speechsdk.audio.AudioConfig(stream=stream)

auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(languages=["en-US", "fr-FR", "es-ES"])


speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, auto_detect_source_language_config=auto_detect_source_language_config, audio_config=audio_config)

Solution

I tried the following Flask app to convert speech to text without saving to disk using Azure Speech SDK in Python.

Code :

app.py :

from flask import Flask, render_template, request
from io import BytesIO
import azure.cognitiveservices.speech as speechsdk
 
app = Flask(__name__)
 
speech_key = '<speech_key>'
service_region = '<speech_region>'
 
@app.route('/', methods=['GET', 'POST'])
def index():
    if request.method == 'POST':
        try:
            file = request.files['audio-file']
            stream = file.stream
            stream.seek(0)
 
            speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
            audio_stream = speechsdk.audio.PushAudioInputStream(stream_format=speechsdk.audio.AudioStreamFormat())

            audio_stream.write(stream.read())
            stream.seek(0)
            audio_stream.close()
            stream.truncate(0)
 
            audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)
            auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(
                languages=["en-US", "fr-FR", "es-ES"])
 
            speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config,
                                                           auto_detect_source_language_config=auto_detect_source_language_config,
                                                           audio_config=audio_config)
 
            result = speech_recognizer.recognize_once()
 
            return render_template('result.html', text=result.text)
 
        except Exception as e:
            return str(e)  
 
    return render_template('index.html')
 
if __name__ == '__main__':
    app.run(debug=True)

templates/index.html :

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Speech to Text</title>
</head>
<body>
    <h1>Upload Audio File</h1>
    <form action="/" method="post" enctype="multipart/form-data">
        <input type="file" name="audio-file" accept="audio/*" required>
        <input type="submit" value="Transcribe">
    </form>
</body>
</html>

templates/result.html :

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Transcription Result</title>
</head>
<body>
    <h1>Transcription Result</h1>
    <p>{{ text }}</p>
</body>
</html>

Output :

The following Flask app ran successfully, as shown below.

enter image description here

I received the output at the browser, as shown below. Then, I chose an audio .wav file to convert speech to text, as shown below.

enter image description here

The speech was successfully converted to text without saving to disk, as shown below.

enter image description here