Search code examples
javahtmlxmlxml-parsingdtd

Java 13 DocumentBuilder breaks when parsing DTD file to validate HTML


I'm working on program that uses DocumentBuilder to parse an old HTML file so that it can be processed accordingly. Within this HTML file, we have the following

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

Here's the code snippet that does the reading:

DocumentBuilderFactory documentBuilderFactory;
DocumentBuilder documentBuilder;

documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilder = documentBuilderFactory.newDocumentBuilder();

Document doc = documentBuilder.parse(htmlSource);

The parsing then fails with the following error:

Error 1:    The declaration for the entity "HTML.Version" must end with '>'.
      Column Number:    3
      System Identifer: null
      toString:         org.xml.sax.SAXParseException; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
      Line Number:      31
      Public Identifer: null
      Caused By:

      The declaration for the entity "HTML.Version" must end with '>'.
      Trace Follows:

org.xml.sax.SAXParseException; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
        at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
        at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(XMLDTDScannerImpl.java:1597)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:2021)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDExternalSubset(XMLDTDScannerImpl.java:299)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1165)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1040)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:943)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:541)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)
        at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at com.rockwellcollins.ana.xml.XmlParser.parse(XmlParser.java:490)
        at com.rockwellcollins.ana.xml.XmlParser.parse(XmlParser.java:592)
        at com.rockwellcollins.qimt.doorsmapper.doorsmapper.HtmlParser.parseHtml(HtmlParser.java:301)
        at com.rockwellcollins.qimt.doorsmapper.doorsmapper.DoorsMapper.applicationSpecificDoIt(DoorsMapper.java:232)
        at com.rockwellcollins.application.common.ApplicationBase.doIt(ApplicationBase.java:795)
        at com.rockwellcollins.qimt.doorsmapper.doorsmapper.DoorsMapper.main(DoorsMapper.java:300)

It's complaining about this section of the DTD file:

<!ENTITY % HTML.Version "-//W3C//DTD HTML 4.01 Transitional//EN"
  -- Typical usage:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
            "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    ...
    </head>
    <body>
    ...
    </body>
    </html>

    The URI used as a system identifier with the public identifier allows
    the user agent to download the DTD and entity sets as needed.

    The FPI for the Strict HTML 4.01 DTD is:

        "-//W3C//DTD HTML 4.01//EN"

    This version of the strict DTD is:

        http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd

    Authors should use the Strict DTD unless they need the
    presentation control for user agents that don't (adequately)
    support style sheets.

    If you are writing a document that includes frames, use 
    the following FPI:

        "-//W3C//DTD HTML 4.01 Frameset//EN"

    This version of the frameset DTD is:

        http://www.w3.org/TR/1999/REC-html401-19991224/frameset.dtd

    Use the following (relative) URIs to refer to 
    the DTDs and entity definitions of this specification:

    "strict.dtd"
    "loose.dtd"
    "frameset.dtd"
    "HTMLlat1.ent"
    "HTMLsymbol.ent"
    "HTMLspecial.ent"

-->

From my initial investigation, it's complaining about the -- comments within the tags. If I remove those, then the first error disappears and moves onto the next one. My question is, how come the DocumentBuilder is not able to read the DTD file correctly?

To add a few things, we are unable to remove the DTD from the HTML file because the HTML provided is HTML 4 specific and without it, the parsing fails because of the HTML 4 formatting.


Solution

  • The HTML 4.01 is an SGML DTD (XML is a subset of SGML) and HTML can't be parsed using an XML parser. You're right that the commenting syntax in SGML allows for comments appearing in markup declarations anywhere and multiple times, in contrast to XML. For example, the following is a valid SGML element declaration:

    <!ELEMENT e - - (#PCDATA)
      -- declaration for e --
      -- ... other comment -->
    

    The declaration also hints at one of the features the XML subset of SGML doesn't support (but needed for parsing HTML), namely tag inference (tag omission). The - O sequence following the element name e means that e allows end-element tag omission ("O" as in letter O) but no start-element omission ("-" hyphen-minus). Other needed features that XML doesn't support are SGML/HTML-style empty elements such as img (without an end-element tag) and attribute minimization (as in <div hidden>).