How can I change the pronunciation of a specific word by Alexa in a custom skill?

Sometimes, when developing an Alexa skill and programming the responses from my service, Alexa mispronounces one of the words in my reply, confusing the user.

For example, if I wanted Alexa to say a word, let it be live, how can I tell Alexa how to pronounce the word correctly because there exist two pronunciations for live.

Is there a way to dictate to Alexa the correct pronunciation, or replace it with a custom sound that is correct? Do I need to use additional markup or an API call?

Solution

Alexa supports SSML, which is an XML-like markup language for speech. Instead of returning plain text from your service, you can use SSML responses. The <phoneme> tag is what you need in particular:

phoneme

Provides a phonemic/phonetic pronunciation for the contained text. For example, people may pronounce words like “pecan” differently.

For English words (especially US English), Alexa should be able to pronounce any word if you give it the correct phonetic pronunciation:

The following tables list the supported symbols for use with the phoneme tag. These symbols provide full coverage for the sounds of US English. Note that many non-English languages require the use of symbols not included in this list, which are not supported. Using symbols not included in this list is discouraged, as it may result in suboptimal speech synthesis.

^{Quotes from Amazon documentation on SSML.}

Here's an example of giving Alexa a specific pronunciation for your word live:

<speak>
    <phoneme alphabet="ipa" ph="lɪv">live 1</phoneme>.
    <phoneme alphabet="ipa" ph="laɪv">live 2</phoneme>.
</speak>

The <phoneme> tag supports the IPA and X-SAMPA phonetic alphabets. You can typically find IPA spellings for any word on Wiktionary or through Google.

For longer messages, it may be best to use the <audio> tag and record a custom voice:

The audio tag lets you provide the URL for an MP3 file that the Alexa service can play while rendering a response. You can use this to embed short, pre-recorded audio within your service’s response. For example, you could include sound effects alongside your text-to-speech responses, or provide responses using a voice associated with your brand.

^{Quoted from Amazon documentation on <audio>.}