Search code examples
pythonlxmllibxml2

What is the maximum size of an XML file when using python's lxml etree


In our application we use python's lxml to read an XML string in memory:

parser = etree.XMLParser(... huge_tree=False)
xml = etree.fromstring(src, parser)

I noticed that it bails out when the content of src is more than 10Mb. This is the expected behaviour with huge_tree set to False.

What I can't find information on is: why 10Mb? The documentation says:

huge_tree - disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

Also, libxml's changelog says:

include/libxml/parserInternals.h SAX2.c: add a new define XML_MAX_TEXT_LENGTH limiting the maximum size of a single text node, the defaultis 10MB and can be removed with the HUGE parsing option

However I don't understand if this is hard-coded, and why was this choice ever made.

The reason I'm asking is that we're dealing with the occasional input larger than that (when there is a large binary attachment, for example) and perhaps it's possible to raise that limit to a more reasonable value, without disabling it completely.


Solution

  • The 10000000 value is hard-coded in parserInternals.h of libxml. The limit was initially imposed shortly after a fix for CVE-2008-4226, which addressed an issue where extremely large text nodes would cause a memory overflow (by overflowing the amount of addressable memory).

    The 10 MB value is arbitrary, which is why there's an option to override it. It seems to be intended to help mitigate exploits of memory-overflow errors in libxml from appearing in the wild by requiring that the programmer explicitly request that the parser allocates as much memory as possible (basically size_t) to the text node.

    That doesn't quite answer why 10 MB, but it probably seemed large enough to deal with the case of programmers just throwing XML at the parser without thinking about whether or not to trust the source of the file.