Search code examples
javahtmlxhtmlsaxkarate

Karate HTML parsing throwing SaxException when document begins with lower-case <!doctype


I am trying to run a Karate test that calls a GET on a URL, but I have discovered that when the site returns its <!doctype declaration in lower case (perfectly acceptable in 'normal' HTML), I think the Karate XML parser throws a fatal error and warning. It looks to me that Karate uses an XML parser, so strictly speaking, this is probably correct behaviour as lower case doctype will break. However, I cannot find a way to get around this for valid HTML. I have played about with different headers and such, but can't seem to get past this.

I have included a small test, luckily google.com returns lower case declaration too:

Example Test

Given url 'http://www.google.com'
When method GET
Then status 200

Error

[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.
15:19:45.267 [main] WARN com.intuit.karate.FileUtils - parsing failed: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 3; The markup in the document preceding the root element must be well-formed.

<!doctype html><html .... blah

I downloaded the Karate source and found the warning that is reported:

FileUtils.java

public static String toPrettyString(String raw) {
    raw = StringUtils.trimToEmpty(raw);
    try {
        if (Script.isJson(raw)) {
            return JsonUtils.toPrettyJsonString(JsonUtils.toJsonDoc(raw));
        } else if (Script.isXml(raw)) {
            return XmlUtils.toString(XmlUtils.toXmlDoc(raw), true);
        }
    } catch (Exception e) {
        logger.warn("parsing failed: {}", e.getMessage());
    }
    return raw;
}

The check seems to be between either JSON or XML by checking the first character of the returned document:

Script.java

public static final boolean isXml(String text) {
    return text.startsWith("<");
}

XmlUtils.java

Then I believe that the builder.parse is failing as it's not valid XHTML, as the comment that follows implies that the <!doctype would be removed on the recursive call.

public static Document toXmlDoc(String xml) {
    ...

    Document doc = builder.parse(is);
    if (dtdEntityResolver.dtdPresent) { // DOCTYPE present
        // the XML was not parsed, but I think it hangs at the root as a text node
        // so conversion to string and back has the effect of discarding the DOCTYPE !
        return toXmlDoc(toString(doc, false));

Is it possible to divert this flow for valid HTML?


Solution

  • If you look at the log, Karate also tells you that it has retained the full response (which will be available in the response variable) as a string - even though it failed to "type cast" it to XML. By the way you even have a byte-array in responseBytes. So now it is up to you to do anything you want, for example you could in theory find an HTML parser that is "lenient" and get a DOM tree or something.

    Given url 'http://www.google.com'
    When method GET
    Then status 200
    * print response
    

    A couple of hints, you could try to do a string replace on the response and then attempt to type-cast it to XML, refer: https://github.com/intuit/karate#type-conversion

    Or maybe all you are trying to do is scrape some data out, and some normal regex matching may do, refer these:

    https://stackoverflow.com/a/53682733/143475

    https://stackoverflow.com/a/50372295/143475