I want to convert a book to audio, and save the file, so naturally I don't want my computer to be speaking the book out loud while the conversion happens, but looking at the Azure documentation, I frankly don't see a way to get a stream object without speaking the text first. I've already got the code set up so that I can save the file, but I can't save the file unless I play that audio first. I want to convert some text to a stream object without having to listen to my computer utter the text. I realize a very inelegant solution is to simply mute my computer, but still, suppose the conversion takes an hour and I need to take a phone call on it.
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(subscription=subscription_key,
region=service_region)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
In the following line, I don't want to do this step because this utters the audio:
result = speech_synthesizer.speak_text_async("I'm excited to try text to speech").get()
But I have to do that step in order to get the following steps:
stream = AudioDataStream(result)
stream.save_to_wav_file(path)
I've tried looking at all the methods listed in the speech_synthesizer
object but all of them involve speaking the text, they are listed here:
class SpeechSynthesizer(builtins.object)
| SpeechSynthesizer(speech_config: azure.cognitiveservices.speech.SpeechConfig, audio_config: Optional[azure.cognitiveservices.speech.audio.AudioOutputConfig] = <azure.cognitiveservices.speech.audio.AudioOutputConfig object at 0x137ffc790>, auto_detect_source_language_config: azure.cognitiveservices.speech.languageconfig.AutoDetectSourceLanguageConfig = None)
|
| A speech synthesizer.
|
| :param speech_config: The config for the speech synthesizer
| :param audio_config: The config for the audio output.
| This parameter is optional.
| If it is not provided, the default speaker device will be used for audio output.
| If it is None, the output audio will be dropped.
| None can be used for scenarios like performance test.
| :param auto_detect_source_language_config: The auto detection source language config
|
| Methods defined here:
|
| __del__(self)
|
| __init__(self, speech_config: azure.cognitiveservices.speech.SpeechConfig, audio_config: Optional[azure.cognitiveservices.speech.audio.AudioOutputConfig] = <azure.cognitiveservices.speech.audio.AudioOutputConfig object at 0x137ffc790>, auto_detect_source_language_config: azure.cognitiveservices.speech.languageconfig.AutoDetectSourceLanguageConfig = None)
| Initialize self. See help(type(self)) for accurate signature.
|
| get_voices_async(self, locale: str = '') -> azure.cognitiveservices.speech.ResultFuture
| Get the available voices, asynchronously.
|
| :param locale: Specify the locale of voices, in BCP-47 format; or leave it empty to get all available voices.
| :returns: A task representing the asynchronous operation that gets the voices.
|
| speak_ssml(self, ssml: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
| Performs synthesis on ssml in a blocking (synchronous) mode.
|
| :returns: A SpeechSynthesisResult.
|
| speak_ssml_async(self, ssml: str) -> azure.cognitiveservices.speech.ResultFuture
| Performs synthesis on ssml in a non-blocking (asynchronous) mode.
|
| :returns: A future with SpeechSynthesisResult.
|
| speak_text(self, text: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
| Performs synthesis on plain text in a blocking (synchronous) mode.
|
| :returns: A SpeechSynthesisResult.
|
| speak_text_async(self, text: str) -> azure.cognitiveservices.speech.ResultFuture
| Performs synthesis on plain text in a non-blocking (asynchronous) mode.
|
| :returns: A future with SpeechSynthesisResult.
|
| start_speaking_ssml(self, ssml: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
| Starts synthesis on ssml in a blocking (synchronous) mode.
|
| :returns: A SpeechSynthesisResult.
|
| start_speaking_ssml_async(self, ssml: str) -> azure.cognitiveservices.speech.ResultFuture
| Starts synthesis on ssml in a non-blocking (asynchronous) mode.
|
| :returns: A future with SpeechSynthesisResult.
|
| start_speaking_text(self, text: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
| Starts synthesis on plain text in a blocking (synchronous) mode.
|
| :returns: A SpeechSynthesisResult.
|
| start_speaking_text_async(self, text: str) -> azure.cognitiveservices.speech.ResultFuture
| Starts synthesis on plain text in a non-blocking (asynchronous) mode.
|
| :returns: A future with SpeechSynthesisResult.
|
| stop_speaking(self) -> None
| Synchronously terminates ongoing synthesis operation.
| This method will stop playback and clear unread data in PullAudioOutputStream.
|
| stop_speaking_async(self) -> azure.cognitiveservices.speech.ResultFuture
| Asynchronously terminates ongoing synthesis operation.
| This method will stop playback and clear unread data in PullAudioOutputStream.
|
| :returns: A future that is fulfilled once synthesis has been stopped.
|
Someone recommended using the synthesize_speech_to_stream_async
method but his code resulted in errors and I haven't heard back from him, but I think he might be on to something.
His code was
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=service_region)
speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'
stream = speechsdk.AudioDataStream(format=speechsdk.AudioStreamFormat(pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit, sample_rate_hertz=16000, channel_count=1))
result = speechsdk.SpeechSynthesizer(speech_config=speech_config).synthesize_speech_to_stream_async("I'm excited to try text to speech", stream).get()
stream.save_to_wav_file(path)
This generated an error:
stream = speechsdk.AudioDataStream(
format=speechsdk.AudioStreamFormat(
pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit,
sample_rate_hertz=16000, channel_count=1))
which recommended:
stream = speechsdk.AudioDataStream(
format=speechsdk.AudioStreamWaveFormat(
pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit,
sample_rate_hertz=16000, channel_count=1))
But that generated:
AttributeError: module 'azure.cognitiveservices.speech' has no attribute 'PcmDataFormat'
I tried the following code to save a stream audio to a .wav file in Azure Text to Speech without speaking the text, using Python.
Code :
import azure.cognitiveservices.speech as speechsdk
import io
import tempfile
subscription_key = '<speech_key>'
service_region = '<speech_region>'
speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=service_region)
speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'
temp_file_path = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
audio_config = speechsdk.audio.AudioOutputConfig(filename=temp_file_path)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
text_to_speak = "Hi Kamali! I am happy to see you."
result = speech_synthesizer.speak_text_async(text_to_speak).get()
file_path = 'output.wav'
with open(file_path, 'wb') as audio_file:
audio_file.write(result.audio_data)
print(f"Audio saved to {file_path}")
Output :
The program ran successfully, converting the text to speech and saving it as a .wav file without any spoken output.
C:\Users\xxxxx\Documents\xxxxx>python sample.py
Audio saved to output.wav