I am attempting to decode XML that is spit out by draw.io. According to their documentation, this is "compressed using standard deflate".
I am using code provided in this question to do to inflation.
import zlib
import base64
def decode_base64_and_inflate( b64string ):
decoded_data = base64.b64decode( b64string )
return zlib.decompress( decoded_data , -15)
Sample input file:
<mxlibrary>[{"xml":"rVLJboMwEP0aH1t5EUg5Bmhy6ilfQMsULBlMbRNIv77jhSAOSD1UwnjmvdmsN0SU/XI19di96wYUEW9ElEZrF61+KUEpwqlsiKgI5xQP4ZcDlgWWjrWBwf0lgceEe60miAjhucLUopF3JKx7qEjk35MfqvjSg3ux8gfRMwacxgXBjUar9ffZWug/FJi1Hs4QSkb6n7qwDMkFD6NHffiuPjd6Ghrwr6dIz510cBvrT8/OqAJinetRhoqlKW5hiOqE7qjl4KyvkxX4YcuSvgqBFyMZFi+fYJ7vQBbAbB8Y/PT3aFYlLcA4WA71DFAS8wq6B2ceGDLLxnUpIoua0w5k261pNIG1jUD7zN3WA420Iau7bWLgdov6Cw==","w":150,"h":100,"aspect":"fixed"}]</mxlibrary>
I am reading this like so:
from xml.dom import minidom
from urllib.parse import unquote
xmldoc = minidom.parse('samplescratchpad.xml')
buildings = xmldoc.getElementsByTagName('mxlibrary')
# I know eval is bad, but this was being returned as '[...]' instead of
# just a list.
all_buildings = eval(buildings[0].firstChild.nodeValue)
for building in all_buildings:
print(type(decode_base64_and_inflate(building['xml'])))
print(decode_base64_and_inflate(building['xml']))
print(unquote(decode_base64_and_inflate(building['xml'])))
The output of the first two print statements are:
<class 'bytes'>
b'%3CmxGraphModel%3E%3Croot%3E%3CmxCell%20id%3D%220%22%2F%3E%3CmxCell%20id%3D%221%22%20parent%3D%220%22%2F%3E%3CmxCell%20id%3D%222%22%20value%3D%22%26lt%3Bdiv%20style%3D%26quot%3Bfont-size%3A%209px%3B%26quot%3B%26gt%3BAssembler%26lt%3B%2Fdiv%26gt%3B%26lt%3Bdiv%20style%3D%26quot%3Bfont-size%3A%209px%3B%26quot%3B%26gt%3B15%20x%2010%26lt%3B%2Fdiv%26gt%3B%22%20style%3D%22rounded%3D0%3BwhiteSpace%3Dwrap%3Bhtml%3D1%3BfontSize%3D9%3Bpoints%3D%5B%5B0%2C0.33%2C1%5D%2C%5B0%2C0.66%2C1%5D%2C%5B1%2C0.5%2C1%5D%2C%5B0.5%2C0.5%2C0%5D%5D%22%20vertex%3D%221%22%20parent%3D%221%22%3E%3CmxGeometry%20width%3D%22150%22%20height%3D%22100%22%20as%3D%22geometry%22%2F%3E%3C%2FmxCell%3E%3C%2Froot%3E%3C%2FmxGraphModel%3E'
The last print, where I try to convert the above into more standard XML, fails:
File "test_deflate.py", line 36, in <module>
print(unquote(decode_base64_and_inflate(building['xml'])))
File "/usr/lib/python3.5/urllib/parse.py", line 537, in unquote
if '%' not in string:
TypeError: a bytes-like object is required, not 'str'
How do I fix this so that the bytes object I have (see first two print outputs) works when I attempt to unquote
it?
Bonus: Is the eval
really needed on my all_buildings =
line?
You've got the problem backwards, actually, because the error message is misleading.
Your argument is a bytes
-like object, but the unquote function is using the expression '%' in string
, and '%'
is not a bytes
-like object, which doesn't work. Either both operands must be bytes
or both must be str
.
Python misleadingly tells you to change the first operand ('%'
) to bytes
, but since that's a hard-coded part of the function, that isn't possible. You need to turn the other argument into a str
instead.
Try to replace
print(
unquote(
decode_base64_and_inflate(building['xml'])
)
)
with
print(
unquote(
decode_base64_and_inflate(
building['xml']
).decode('utf8')
)
)
This will decode the bytes
as a UTF8-encoded Unicode string (most likely the correct encoding), and yield a str
that can be passed to unquote()
.
Edit: The reason Python uses this error message is that the in
operator is internally a method call on the second operand; that is, a in b
is evaluated as b.__contains__(a)
. Therefore, b
determines what type a
is allowed to have, not the other way around - which means Python will tell you to change the first operand's type, rather than telling you to change the second one.