I am trying to run a Karate test that calls a GET on a URL, but I have discovered that when the site returns its <!doctype
declaration in lower case (perfectly acceptable in 'normal' HTML), I think the Karate XML parser throws a fatal error and warning. It looks to me that Karate uses an XML parser, so strictly speaking, this is probably correct behaviour as lower case doctype
will break. However, I cannot find a way to get around this for valid HTML. I have played about with different headers and such, but can't seem to get past this.
I have included a small test, luckily google.com returns lower case declaration too:
Example Test
Given url 'http://www.google.com'
When method GET
Then status 200
Error
[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.
15:19:45.267 [main] WARN com.intuit.karate.FileUtils - parsing failed: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 3; The markup in the document preceding the root element must be well-formed.
<!doctype html><html .... blah
I downloaded the Karate source and found the warning that is reported:
FileUtils.java
public static String toPrettyString(String raw) {
raw = StringUtils.trimToEmpty(raw);
try {
if (Script.isJson(raw)) {
return JsonUtils.toPrettyJsonString(JsonUtils.toJsonDoc(raw));
} else if (Script.isXml(raw)) {
return XmlUtils.toString(XmlUtils.toXmlDoc(raw), true);
}
} catch (Exception e) {
logger.warn("parsing failed: {}", e.getMessage());
}
return raw;
}
The check seems to be between either JSON or XML by checking the first character of the returned document:
Script.java
public static final boolean isXml(String text) {
return text.startsWith("<");
}
XmlUtils.java
Then I believe that the builder.parse
is failing as it's not valid XHTML, as the comment that follows implies that the <!doctype
would be removed on the recursive call.
public static Document toXmlDoc(String xml) {
...
Document doc = builder.parse(is);
if (dtdEntityResolver.dtdPresent) { // DOCTYPE present
// the XML was not parsed, but I think it hangs at the root as a text node
// so conversion to string and back has the effect of discarding the DOCTYPE !
return toXmlDoc(toString(doc, false));
Is it possible to divert this flow for valid HTML?
If you look at the log, Karate also tells you that it has retained the full response (which will be available in the response
variable) as a string - even though it failed to "type cast" it to XML. By the way you even have a byte-array in responseBytes
. So now it is up to you to do anything you want, for example you could in theory find an HTML parser that is "lenient" and get a DOM tree or something.
Given url 'http://www.google.com'
When method GET
Then status 200
* print response
A couple of hints, you could try to do a string replace on the response
and then attempt to type-cast it to XML, refer: https://github.com/intuit/karate#type-conversion
Or maybe all you are trying to do is scrape some data out, and some normal regex matching may do, refer these: