I am synthesising text using Azure Speech Service's TTS. When setting the audio config, I want to disable the playback of the audio. Per the documentation, AudioOutputConfig
's use_default_speaker
keyword is False by default. Hence, the following code should work:
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription=os.environ.get('SPEECH_KEY'),
region=os.environ.get('SPEECH_REGION')
)
audio_config = speechsdk.audio.AudioOutputConfig()
but I get the following error:
ValueError: default speaker needs to be explicitly activated
The same goes if I set use_default_speaker=True
.
The only way I can get the code to run is if I explicitly set use_default_speaker=False
, but this way the audio is spoken to the computer's speakers, which is annoying and time consuming when generating multiple samples.
I also tried experimenting with the stream
keyword, but I can't figure out what to set it to.
I don't want to write the data to a wav file using the filename
kw.
Does anyone know how I can turn off the behaviour of playing back the audio?
I found out by trial and error using different options from the Azure documentation, though they weren't particularly helpful. It turns out you can use PullAudioOutputStream()
as your audio config:
import azure.cognitiveservices.speech as speechsdk
import os
speech_config = speechsdk.SpeechConfig(
subscription=os.environ.get('SPEECH_KEY'),
region=os.environ.get('SPEECH_REGION')
)
audio_config = speechsdk.audio.PullAudioOutputStream() # Change here
speech_synthesiser = speechsdk.SpeechSynthesizer(
speech_config=speech_config, audio_config=audio_config
)
xml_str = """<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" version="1.0" xml:lang="sv-SE"><voice name="sv-SE-SofieNeural">Hej</voice></speak>"""
speech_synthesis_result = speech_synthesiser.speak_ssml(xml_str)
bytearray = speech_synthesis_result.audio_data[44:] # removing the riff header
A heads up: you may want to remove the RIFF header if you want to stitch together multiple audio bytearrays without introducing click noises.