Search code examples
javaxmlxhtmljtidy

Why does JTidy strip out <video> element from XML?


I am using JTidy to process XHTML documents, and I now have one containing a <video> element, which JTidy strips out. Here is the code:

import org.w3c.dom.Node;
import org.w3c.tidy.Tidy;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import static java.nio.charset.StandardCharsets.UTF_8;

public class Test {
  public static void main (String[] args) throws Exception {
    // Set up a JTidy instance
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF8");

    // The following make no difference to the output
    // whether they are present or not, or whether the
    // parameters are changed from true to false or vice versa
    tidy.setQuiet(true);
    tidy.setShowWarnings(false);
    tidy.setXHTML(true);
    tidy.setDropEmptyParas(false);
    tidy.setTrimEmptyElements(false);

    // Process XHTML from a string
    String xml = "<div>\n"
               + "  Video goes here:<br/>\n"
               + "  <video width='640' height='480'>\n"
               + "   <source src='foo.mp4' type='video/mp4'/>\n"
               + "  </video>\n"
               + "</div>";

    byte[] bytes = xml.getBytes(UTF_8);
    InputStream in = new ByteArrayInputStream(bytes);
    Node node = tidy.parseDOM(in,null).getDocumentElement();

    // Display the resulting Node as a sanity check
    tidy.pprint(node,System.out);
  }
}

For the example HTML fragment used in the above code, the relevant part of the output is this:

<div>Video goes here:<br />   </div>

I have been told (below) that <video> is an HTML5 tag which is not valid in XHTML (?), so I have tried using tidy.setXHTML(false) and it makes no difference. I have tried adding <!DOCTYPE html> at the start. I have tried removing all the tidy.setXXX() configuration calls. None of these things (in any combination) make any difference. The only thing that works is to use <embed> instead of <video>, but (a) this is deprecated, (b) I have to replace the <video> tag with <embed> before I parse it, and (c) it doesn't have all the features that <video> does.

So, what can I do to parse a document which contains a video?

Is this an XHTML problem, or just a problem with JTidy, and if the latter is there an alternative I can use?

Or is there a table somewhere of allowed tags for JTidy that I can patch?

And if so, do I need to add all the new HTML5 tags to this table?

Any advice gratefully received...


Solution

  • Good news: I believe you just need to update the version of JTidy you're using.

    The first hit I found when looking for JTidy was the old SourceForge site, where the latest version (r938) was released in 2009. With that version, I can reproduce your problem - so I suspect that's the version you're using.

    However, there's a GitHub repository which is more up-to-date (last commit in June 2024). That the latest release is version 1.0.5, released in September 2023... and with your exact code, the warning goes away and the <video> tag is preserved (whether you have setXHTML(true) or setXHTML(false), interestingly).

    So basically, update to 1.0.5 and that should fix the problem.