Search code examples
python-3.xgoogle-apigoogle-speech-api

Can google speech API convert text to speech?


I used Google speech API ti successfully convert speech to text using following code.

import speech_recognition as sr
import os

#obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Say something!")
    audio = r.listen(source)

# recognize speech using Google Cloud Speech
GOOGLE_CLOUD_SPEECH_CREDENTIALS = r"""{KEY}
"""
# INSERT THE CONTENTS OF THE GOOGLE CLOUD SPEECH JSON CREDENTIALS FILE HERE
try:
    speechOutput = (r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS, language="si-LK"))
except sr.UnknownValueError:
    speechOutput = ("Google Cloud Speech could not understand audio")
except sr.RequestError as e:
    speechOutput = ("Could not request results from Google Cloud Speech service; {0}".format(e))
print(speechOutput)

I want to know if i can convert text to speech using the same API? If not what API to use and the sample python code for that. Thank you!


Solution

  • For this you'll need to use the new Text-to-Speech API which is in Beta as of now. You can find a Python quickstart in the Client Library section of the docs. The sample is part of the python-docs-sample repo. Adding the relevant part of the example here for better visibility:

    def synthesize_text(text):
        """Synthesizes speech from the input string of text."""
        from google.cloud import texttospeech
        client = texttospeech.TextToSpeechClient()
    
        input_text = texttospeech.types.SynthesisInput(text=text)
    
        # Note: the voice can also be specified by name.
        # Names of voices can be retrieved with client.list_voices().
        voice = texttospeech.types.VoiceSelectionParams(
            language_code='en-US',
            ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)
    
        audio_config = texttospeech.types.AudioConfig(
            audio_encoding=texttospeech.enums.AudioEncoding.MP3)
    
        response = client.synthesize_speech(input_text, voice, audio_config)
    
        # The response's audio_content is binary.
        with open('output.mp3', 'wb') as out:
            out.write(response.audio_content)
            print('Audio content written to file "output.mp3"')
    

    Update: rate and pitch configuration

    You can enclose the text elements in a <prosody> tag to modify the rateand pitch. For example:

    <prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>
    

    The possible values for those follow the W3 specifications which can be found here. The SSML docs for Text-to-Speech API detail this and they also provide some samples.

    Also, you can control the general audio playback rate with the speed option in <audio>, which currently accepts values from 50 to 200% (in 1% increments).