Search code examples
c#asp.netportingttmlwebvtt

Parsing XML for Converting Time Text Markup to WebVTT


I working on a web application that can take in a subtitle file in either Time Text Markup(TTML) or WebVTT format. If the file is Timed Text, I want to translate it to WebVTT. This is mostly not an issue, the one problem I'm having is that if the TTML has HTML as part of the text content, then the HTML tags get dropped.

For example:

<p begin="00:00:08.18" dur="00:00:03.86">(Music<br />playing)</p>

results in:

(Musicplaying)

The code I use is:

private const string TIME_FORMAT = "hh\\:mm\\:ss\\.fff";
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(fileLocation);
XDocument xdoc = xmldoc.ToXDocument();
var ns = (from x in xdoc.Root.DescendantsAndSelf()
          select x.Name.Namespace).First();

List<TTMLElement> elements =
(
     from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
     select new TTMLElement
     {
          text = item.Value,
          startTime = TimeSpan.Parse(item.Attribute("begin").Value),
          duration = TimeSpan.Parse(item.Attribute("dur").Value),
     }
).ToList<TTMLElement>();

StringBuilder sb = new StringBuilder();
sb.AppendLine("WEBVTT");
sb.AppendLine();

for (int i = 0; i < elements.Count; i++)
{
     sb.AppendLine(i.ToString());
     sb.AppendLine(elements[i].startTime.ToString(TIME_FORMAT) + " --> " + elements[i].startTime.Add(elements[i].duration).ToString(TIME_FORMAT));
     sb.AppendLine(elements[i].text);
     sb.AppendLine();
}

Any thoughts on what I'm missing or if there is just a better way of doing this or even if there is already a solution for converting Time Text to WebVTT would be appreciated. Thanks.


Solution

  • I finally came back to this project and I also found a solution to my problem.

    First in this section:

    from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
        select new TTMLElement
        {
            text = item,
            startTime = TimeSpan.Parse(item.Attribute("begin").Value),
            endTime = item.Attribute("dur") != null ?
              TimeSpan.Parse(item.Attribute("begin").Value).Add(TimeSpan.Parse(item.Attribute("dur").Value)) :
              TimeSpan.Parse(item.Attribute("end").Value)
       }
    

    item is of type XElement so an XmlReader object can be created from it resulting in this function:

    private static string ReadInnerXML(XElement parent)
    {
        var reader = parent.CreateReader();
        reader.MoveToContent();
        var innerText = reader.ReadInnerXml();
        return innerText;
    }
    

    For my purposes of removing the html inside the node I modified the function to look like this:

    private static string ReadInnerXML(XElement parent)
    {
        var reader = parent.CreateReader();
        reader.MoveToContent();
        var innerText = reader.ReadInnerXml();
        innerText = Regex.Replace(innerText, "<.+?>", " ");
        return innerText;
    }
    

    Finally resulting in the above lambda looking like this:

    from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
        select new TTMLElement
        {
            text = ReadInnerXML(item),
            startTime = TimeSpan.Parse(item.Attribute("begin").Value),
            endTime = item.Attribute("dur") != null ?
              TimeSpan.Parse(item.Attribute("begin").Value).Add(TimeSpan.Parse(item.Attribute("dur").Value)) :
              TimeSpan.Parse(item.Attribute("end").Value)
       }