Search code examples
c#speech-synthesis

SpeechSynthesizer with a duration property?


i need a speech synthesizer with a settable duration property to specify how long it will take to speak the text. the System.Speech.Synthesis.SpeechSynthesizer class only has a Rate property.

there's a System.Speech.Synthesis.TtsEngine NameSpace that has a Prosody class with a settable Duration property. but i can't find any examples of how to use TtsEngine or how this property can apply to the SpeechSynthesizer class (if that's even possible). or is there a different speech synthesis library i should look into?


Solution

  • i think i figured it out, thanks to a hint from the first response to this question.

        using System.Speech.Synthesis;
        
        SpeechSynthesizer synthesizer = new SpeechSynthesizer();
    
        void speak_utterance(string utterance_text, int duration_millisec = 0) {
    
            if (duration_millisec <= 0) {
                synthesizer.Speak(utterance_text);
            }
            else {
                PromptBuilder builder = new PromptBuilder();
                builder.AppendSsmlMarkup("<prosody duration='" + duration_millisec.ToString() + "ms'>" + utterance_text + "</prosody>");
                synthesizer.Speak(builder);
            }
        }
    

    i noticed some unexpected interaction with the duration and when the synthesizer speaks numbers. for example:

        string clearance0 = "american one twenty three cleared to land runway one left"
        string clearance1= "american 123 cleared to land runway one left"
    
        speak_utterance(clearance0, 10000);
        speak_utterance(clearance1, 10000);
    

    for the first call, the whole speech is uniformly slow and drawn out over 10 seconds.

    for the second call, the "american 123" is slow and drawn out as in the first, but the latter portion of the utterance is spoken at a normal speed giving an overall duration that is < that desired. so i'll have to convert numbers to words to get consistent performance. (or maybe there's a property that influences how the synthesizer handles numbers that would correct this. will update if i find anything.)