Search code examples
c#powershellaudiotext-to-speech

Microsoft SpeechSynthesizer crackles when outputting to files and streams


I'm writing a thing that uses the SpeechSynthesizer to generate wave files on request, but I'm having problems with crackling noises. The weird thing is that output directly to the sound card is just fine.

This short powershell script demonstrates the issue, though I'm writing my program in C#.

Add-Type -AssemblyName System.Speech
$speech = New-Object System.Speech.Synthesis.SpeechSynthesizer
$speech.Speak('Guybrush Threepwood, mighty pirate!')
$speech.SetOutputToWaveFile("${PSScriptRoot}\foo.wav")
$speech.Speak('Guybrush Threepwood, mighty pirate!')

What this should do, is output to the speakers, and then save that same sound as "foo.wav" next to the script.

What it does is output to the speakers, and then save a crackling, old record player sounding version as a wave file. I've tested this on three different machines, and though they select different voices by default (all Microsoft provided default ones), they all sound like garbage falling down stairs in the wave file.

Why?

EDIT: I am testing this on Windows 10 Pro, with the latest updates that add that annoying "People" button on the taskbar.

EDIT 2: Here's a link to an example sound generated with the above script. Notice the crackling voice, that's not there when the script outputs directly to the speakers.

EDIT 3: It's even more noticeable with a female voice

EDIT 4: The same voice as above, saved to file with TextAloud 3 - no cracking, no vertical spikes.


Solution

  • This is an issue with the SpeechSynthesizer API, which simply provides bad quality, crackling audio as seen in the samples above. The solution is to do what TextAloud does, which is to use the SpeechLib COM objects directly.

    This is done by adding a COM reference to "Microsoft Speech Object Library (5.4)". Here is a snippet of the code I ended up with, which produces audio clips of the same quality as TextAloud:

    public new static byte[] GetSound(Order o)
    {
        const SpeechVoiceSpeakFlags speechFlags = SpeechVoiceSpeakFlags.SVSFlagsAsync;
        var synth = new SpVoice();
        var wave = new SpMemoryStream();
        var voices = synth.GetVoices();
        try
        {
            // synth setup
            synth.Volume = Math.Max(1, Math.Min(100, o.Volume ?? 100));
            synth.Rate = Math.Max(-10, Math.Min(10, o.Rate ?? 0));
            foreach (SpObjectToken voice in voices)
            {
                if (voice.GetAttribute("Name") == o.Voice.Name)
                {
                    synth.Voice = voice;
                }
            }
            wave.Format.Type = SpeechAudioFormatType.SAFT22kHz16BitMono;
            synth.AudioOutputStream = wave;
            synth.Speak(o.Text, speechFlags);
            synth.WaitUntilDone(Timeout.Infinite);
    
            var waveFormat = new WaveFormat(22050, 16, 1);
            using (var ms = new MemoryStream((byte[])wave.GetData()))
            using (var reader = new RawSourceWaveStream(ms, waveFormat))
            using (var outStream = new MemoryStream())
            using (var writer = new WaveFileWriter(outStream, waveFormat))
            {
                reader.CopyTo(writer);
                return o.Mp3 ? ConvertToMp3(outStream) : outStream.GetBuffer();
            }
        }
        finally
        {
            Marshal.ReleaseComObject(voices);
            Marshal.ReleaseComObject(wave);
            Marshal.ReleaseComObject(synth);
        }
    }
    

    This is the code to convert a wave file to mp3. It uses NAudio.Lame from nuget.

    internal static byte[] ConvertToMp3(Stream wave)
    {
        wave.Position = 0;
        using (var mp3 = new MemoryStream())
        using (var reader = new WaveFileReader(wave))
        using (var writer = new LameMP3FileWriter(mp3, reader.WaveFormat, 128))
        {
            reader.CopyTo(writer);
            mp3.Position = 0;
            return mp3.ToArray();
        }
    }