azure text-to-speech azure-cognitive-services

Azure "Text-To-Speech" returns "Invalid CID or language". What does it mean?

I'm trying to post to Azure text-to-speech service. I have already acquired the access token and now I'm trying to make a call to convert text to speech (using Best HTTP in Unity):

            HTTPRequest request = new HTTPRequest(new Uri(APIEndpointURL), HTTPMethods.Post, _GotTextToSpeechResponse);

        request.AddHeader("Authorization", "Bearer " + accessToken);
        request.AddHeader("Content-Type", "application/ssml+xml");
        request.AddHeader("X-Microsoft-OutputFormat", "raw-16khz-16bit-mono-pcm");
        request.AddHeader("User-Agent", "My app name");

        request.RawData = Encoding.UTF8.GetBytes("Hello world!");
        request.Send();

This returns a status code 400 with the following:

{"Message":"Invalid CID or language"}"

Documentation says that if I don't define language but just send text, it should use the default voice. Then, there's the "User-Agent" that should be "Application name". The documentation doesn't say if this should be predefined somewhere or what this refers to.

What could the error mean and how to fix it? Am I doing wrong when I'm posting as "Raw data"? It says I should post the text in the body of the request.

Solution

There are a few things that are not clear in the documentation.

If you have a detailed look to the sample provided here:

Endpoint

You want to do some text-to-speech feature (generate the voice for your Hello world! text), but you are calling a stt (speech-to-text) endpoint made for speaker recognition:

https://westeurope.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1

To use tts, the endpoint should be in the same format as the sample:

https://westeurope.tts.speech.microsoft.com/cognitiveservices/v1

Content of the request

Regarding the fact that you don't want to use SSML, the doc states:

Text is sent as the body of an HTTP POST request. It can be plain text (ASCII or UTF-8) or Speech Synthesis Markup Language (SSML) format (UTF-8). Plain text requests use the Speech Service's default voice and language. With SSML you can specify the voice and language.

So I tried the following: changing the content-type from "application/ssml+xml" to "text/plain". But in that case, I got:

Error 400 Data at the root level is invalid. Line 1, position 1.

It looks like it is a common error when parsing xml, so it looks like there is a bug somewhere here, and I can't find a sample in the doc which is using TTS and without ssml.

Someone posted a question regarding that in the Feedback section of the page (under Next Steps here)