Search code examples
htmlgroovyxmlslurper

How to work around Groovy's XmlSlurper refusing to parse HTML due to DOCTYPE and DTD restrictions?


I'm trying to copy an element in an HTML coverage report, so the coverage totals appear at the top of the report as well as the bottom.

The HTML starts thus and I believe is well-formed:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
    <link rel="stylesheet" href=".resources/report.css" type="text/css" />
    <link rel="shortcut icon" href=".resources/report.gif" type="image/gif" />
    <title>Unified coverage</title>
    <script type="text/javascript" src=".resources/sort.js"></script>
  </head>
  <body onload="initialSort(['breadcrumb', 'coveragetable'])">

Groovy's XmlSlurper complains as follows:

doc = new XmlSlurper( /* false, false, false */ ).parse("index.html")
[Fatal Error] index.html:1:48: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.

Enabling DOCTYPE:

doc = new XmlSlurper(false, false, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(false, true, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.


doc = new XmlSlurper(true, true, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(true, false, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

So I think I've covered all the options. There must be a way to get this working without resorting to regexps and risking the wrath of Tony The Pony.


Solution

  • Tsk.

    parser=new XmlSlurper()
    parser.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false) 
    parser.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    parser.parse(it)