python compression overflow decoding base85

base85 overflow error during decoding of base85 encoded string

I need to embed binary data into XML files, so I've chosen to use base85 encoding for this.

I have a large bytearray that's filled with the output of calls to struct.pack() via bytearray.extend(struct.pack(varying_data)). It then gets compressed with zlib and encoded with base64.b85encode().

This worked all the time, but on a single input file, there is the following strange error:

ValueError: base85 overflow in hunk starting at byte 582200`

I then modified base64.py to print out which value the current chunk has and which bytes it consists of. The input chunk is b'||a|3' and its value is 4.331.076.573, which is bigger than 256^4 = 4.294.967.296 and thus can't be represented by four bytes (that's where the error comes from).

But the thing I don't understand is: how can this happen?

This is the important part of the code:

elif isinstance(self.content, (bytes, bytearray)):
    base85 = zlib.compress(self.content, 9)

    # pad=False doesn't make a difference here
    base85 = base64.b85encode(base85, pad=True).decode()

    base85 = escape_xml(base85)

    file.write(base85)

def escape_xml(text):

    text = text.replace("&", "&amp;")
    text = text.replace("<", "&lt;")
    text = text.replace(">", "&gt;")
    text = text.replace("\"", "&quot;")
    text = text.replace("'", "&apos;")

    return text

And the code for decoding:

def decode_binary_data(data):
    data = unescape_xml(data)

    # Remove newline for mixed content support (does not apply in this case)
    data = data.split("\n", 1)[0]

    # Error!
    data = base64.b85decode(data)

    return zlib.decompress(data)

def unescape_xml(text):
    text = text.replace("&quot;", "\"")
    text = text.replace("&apos;", "'")
    text = text.replace("&lt;", "<")
    text = text.replace("&gt;", ">")
    text = text.replace("&amp;", "&")

    return text

Base85 can theoretically work with 85^5 = 4.437.053.125 possible combinations, but as it gets input from bytes I'm wondering how this is even possible. Does this come from the compression? That shouldn't be the problem as encoding and decoding should be symmetrical. If it is the problem, how to compress the data anyway?

Choosing Ascii85 instead (a84encode()) works, but I think that this doesn't really solve the problem, maybe it fails in other cases?

Thank you for your help!

Solution

I found the problem! Neither the base85 algorithm nor the compression is the issue here. It is the XML.

For exporting/writing the XML with the included base85 string, I wrote my own class and functions to export XML so that it looks pretty (xml.etree.ElementTree writes everything into one line and for this project I can't use external packages from pip). This is why the base85 string has to get escaped manually.

But for reading the XML files, I use xml.etree.ElementTree. I didn't know that the most XML libraries (un)escape strings automatically (which makes sense).

So, the problem was the manual unescaping, which ElementTree does automatically. As a result, the base85 string got unescaped twice. And as the base85 alphabet contains every letter that's included in the XML escape strings ($amp;, $lt; etc.), and with over 500.000 characters in that base85 string, it is likely that there is a combination of characters in the output string that forms a valid XML escape string.

And this was the issue. < was contained in the unescaped base85 string and got unescaped again, resulting in an offset of all following bytes that led to this error.