Search code examples
azureunity-game-enginetext-to-speechazure-cognitive-services

There will be broken sounds at the beginning and end of the playing sound when using Microsoft Azure Text To Speech with Unity


I am using Microsoft Azure Text To Speech with Unity. But there will be broken sounds at the beginning and end of the playing sound. Is this normal, or result.AudioData is broken. Below is the code.

    public AudioSource audioSource;
    void Start()
    {
        SynthesisToSpeaker("你好世界");
    }
    public void SynthesisToSpeaker(string text)
    {
        var config = SpeechConfig.FromSubscription("[redacted]", "southeastasia");
        config.SpeechSynthesisLanguage = "zh-CN";
        config.SpeechSynthesisVoiceName = "zh-CN-XiaoxiaoNeural";

        // Creates a speech synthesizer.
        // Make sure to dispose the synthesizer after use!       
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
        Task<SpeechSynthesisResult> task = synthesizer.SpeakTextAsync(text);
        StartCoroutine(CheckSynthesizer(task, config, synthesizer));
    }
    private IEnumerator CheckSynthesizer(Task<SpeechSynthesisResult> task,
        SpeechConfig config,
        SpeechSynthesizer synthesizer)
    {
        yield return new WaitUntil(() => task.IsCompleted);
        var result = task.Result;
        // Checks result.
        string newMessage = string.Empty;
        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
        {
            var sampleCount = result.AudioData.Length / 2;
            var audioData = new float[sampleCount];
            for (var i = 0; i < sampleCount; ++i)
            {
                audioData[i] = (short)(result.AudioData[i * 2 + 1] << 8
                        | result.AudioData[i * 2]) / 32768.0F;
            }
            // The default output audio format is 16K 16bit mono
            var audioClip = AudioClip.Create("SynthesizedAudio", sampleCount,
                    1, 16000, false);
            audioClip.SetData(audioData, 0);
            audioSource.clip = audioClip;
            audioSource.Play();

        }
        else if (result.Reason == ResultReason.Canceled)
        {
            var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
        }
        synthesizer.Dispose();
    }


Solution

  • The default audio format is Riff16Khz16BitMonoPcm, which has a riff header in the beginning of result.AudioData. If you pass the audioData to audioClip, it will play the header, then you hear some noise.

    You can set the format to a raw format without header by speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw16Khz16BitMonoPcm);, see this sample for details.