Search code examples
c#.netssml

Getting XDocument to recognize embedded SSML


I am using text to speech to add voice to a video tutorial. Currently all the text is in a file and is read into a C# application and parsed into multiple steps. What I would like to do is add ssml to the text file, specifically, the ability to (pause) throughout a specific instruction. I am using example code from Cognitive-Speech-TTS. This code uses a nice clean approach of

private string GenerateSsml(string locale, string gender, string name, string text)
    {
        var ssmlDoc = new XDocument(
                          new XElement("speak",
                              new XAttribute("version", "1.0"),
                              new XAttribute(XNamespace.Xml + "lang", "en-US"),
                              new XElement("voice",
                                  new XAttribute(XNamespace.Xml + "lang", locale),
                                  new XAttribute(XNamespace.Xml + "gender", gender),
                                  new XAttribute("name", name),
                                  text)));


        return ssmlDoc.ToString();
    }

As an example if I set "text" to be

string text = @"During this video we will refer to this as the lens, 
                <break time=""1000ms"" />  this as the headband  
                 <break time=""1000ms"" />, and these as the frame arms 
                <break time=""1000ms"" />. " };
Content = new StringContent(GenerateSsml(inputOptions.Locale, genderValue, inputOptions.VoiceName, text))

It will not recognize the embedded xml. Is there a way to get the XDocument to recognize the xml in text. Note that in the actual application text is being populated from a data file.


Solution

  • You're passing in a string, so LINQ to XML thinks you want that to be a text node, escaping the text as appropriate.

    It looks like you really want to include multiple nodes - some text, and some elements.

    I'd suggest changing your GenerateSsml like this:

    private string GenerateSsml(string locale, string gender, string name, IEnumerable<XNode> nodes)
    {
        var ssmlDoc = new XDocument(
                          new XElement("speak",
                              new XAttribute("version", "1.0"),
                              new XAttribute(XNamespace.Xml + "lang", "en-US"),
                              new XElement("voice",
                                  new XAttribute(XNamespace.Xml + "lang", locale),
                                  new XAttribute(XNamespace.Xml + "gender", gender),
                                  new XAttribute("name", name),
                                  nodes)));
        return ssmlDoc.ToString();
    }
    

    Then change your calling method to:

    var nodes = new XNode[]
    {
        new XText("During this video we will refer to this as the lens,"),
        new XElement("break", new XAttribute("time", "1000ms")),
        new XText(" this as the headband"),
        new XElement("break", new XAttribute("time", "1000ms")),
        new XText(", and these as the frame arms"),
        new XElement("break", new XAttribute("time", "1000ms"))
        new XText("."),
    };
    Content = new StringContent(
        GenerateSsml(inputOptions.Locale, genderValue, inputOptions.VoiceName, nodes));
    

    If you really want to use the string representation instead, you could write:

    string text = ...; // Code as before
    var element = XElement.Parse($"<root>{text}</root>");
    Content = new StringContent(
        GenerateSsml(inputOptions.Locale, genderValue, inputOptions.VoiceName, element.Nodes()));