Karate HTML parsing throwing SaxException when document begins with lower-case <!doctype

I am trying to run a Karate test that calls a GET on a URL, but I have discovered that when the site returns its <!doctype declaration in lower case (perfectly acceptable in 'normal' HTML), I think the Karate XML parser throws a fatal error and warning. It looks to me that Karate uses an XML parser, so strictly speaking, this is probably correct behaviour as lower case doctype will break. However, I cannot find a way to get around this for valid HTML. I have played about with different headers and such, but can't seem to get past this.

I have included a small test, luckily google.com returns lower case declaration too:

Example Test

Given url 'http://www.google.com'
When method GET
Then status 200

Error

[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.
15:19:45.267 [main] WARN com.intuit.karate.FileUtils - parsing failed: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 3; The markup in the document preceding the root element must be well-formed.

<!doctype html><html .... blah

I downloaded the Karate source and found the warning that is reported:

FileUtils.java

public static String toPrettyString(String raw) {
    raw = StringUtils.trimToEmpty(raw);
    try {
        if (Script.isJson(raw)) {
            return JsonUtils.toPrettyJsonString(JsonUtils.toJsonDoc(raw));
        } else if (Script.isXml(raw)) {
            return XmlUtils.toString(XmlUtils.toXmlDoc(raw), true);
        }
    } catch (Exception e) {
        logger.warn("parsing failed: {}", e.getMessage());
    }
    return raw;
}

The check seems to be between either JSON or XML by checking the first character of the returned document:

Script.java

public static final boolean isXml(String text) {
    return text.startsWith("<");
}

XmlUtils.java

Then I believe that the builder.parse is failing as it's not valid XHTML, as the comment that follows implies that the <!doctype would be removed on the recursive call.

public static Document toXmlDoc(String xml) {
    ...

    Document doc = builder.parse(is);
    if (dtdEntityResolver.dtdPresent) { // DOCTYPE present
        // the XML was not parsed, but I think it hangs at the root as a text node
        // so conversion to string and back has the effect of discarding the DOCTYPE !
        return toXmlDoc(toString(doc, false));

Is it possible to divert this flow for valid HTML?

Solution

If you look at the log, Karate also tells you that it has retained the full response (which will be available in the response variable) as a string - even though it failed to "type cast" it to XML. By the way you even have a byte-array in responseBytes. So now it is up to you to do anything you want, for example you could in theory find an HTML parser that is "lenient" and get a DOM tree or something.

Given url 'http://www.google.com'
When method GET
Then status 200
* print response

A couple of hints, you could try to do a string replace on the response and then attempt to type-cast it to XML, refer: https://github.com/intuit/karate#type-conversion

Or maybe all you are trying to do is scrape some data out, and some normal regex matching may do, refer these:

https://stackoverflow.com/a/53682733/143475

https://stackoverflow.com/a/50372295/143475