Getting Memory allocation failed : growing nodeset hit limit with xml2 package

I'm parsing some very big xml files using the xml2 package in R. read_xml() successfully loads the large file, but when I attempt to use xml_find_all(), I get "Error: Memory allocation failed : growing nodeset hit limit." I assume this limit is set within libxml2, perhaps in the XPATH_MAX_NODESET_LENGTH var? so maybe this is not an issue with the xml2 package per se. But is there a solution possible within xml2? I experimented with removing nodes and freeing memory with no luck. Thanks.

Solution

Yes, you're hitting the hardcoded nodeset limit of the libxml2 XPath engine. From xpath.c:

/*
 * XPATH_MAX_NODESET_LENGTH:
 * when evaluating an XPath expression nodesets are created and we
 * arbitrary limit the maximum length of those node set. 10000000 is
 * an insanely large value which should never be reached under normal
 * circumstances, one would first need to construct an in memory tree
 * with more than 10 millions nodes.
 */
#define XPATH_MAX_NODESET_LENGTH 10000000

One option is to recompile libxml2 with a different value. Or you could change your XPath expressions so that they never encounter nodesets larger than 10M nodes. Note that this limit also applies to intermediate nodesets created during the evaluation of an expression. So, unfortunately, segmenting the nodeset with predicates won't work:

//element[position() < 5000000]
//element[position() >= 5000000 and position() < 10000000]
//element[position() >= 10000000 and position() < 15000000]

In essence, you must make sure that every NodeTest doesn't return more than 10M nodes. If you can't do that, you're sadly out of luck.

You could also raise this issue on the libxml2 mailing list. I guess that this limit was introduced to protect against malicious XPath expressions that could lead to denial-of-service attacks. But the way I see it, an expression can never return more nodes than are present in the input documents. So the maximum amount of memory used for a nodeset is limited by the size of the input documents anyway.