I have to work with legacy code which use xml.dom.minidom
(and I can't migrate to lxml
).
I'd like to parse this minimal sample:
<body>
<p>English</p>
<p>Français</p>
</body>
The following function works perfectly:
import codecs
import xml.dom.minidom
def transform1(src_path, dst_path):
tree = xml.dom.minidom.parse(src_path)
# ...
with codecs.open(dst_path, mode="w", encoding="utf-8") as fd:
tree.writexml(fd, encoding="utf-8")
But, if I change to use io
instead, everything goes wrong:
Traceback (most recent call last):
File "/path/to/minidom_demo.py", line 23, in <module>
transform2("sample.xml", "result.xml")
File "/path/to/minidom_demo.py", line 18, in transform2
tree.writexml(fd, encoding="utf-8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1747, in writexml
writer.write('<?xml version="1.0" encoding="%s"?>%s' % (encoding, newl))
TypeError: must be unicode, not str
If I open the file in binary mode (mode="wb"
) I have another exception saying:
Traceback (most recent call last):
File "/path/to/minidom_demo.py", line 23, in <module>
transform2("sample.xml", "result.xml")
File "/path/to/minidom_demo.py", line 18, in transform2
tree.writexml(fd, encoding="utf-8")
...
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 298, in _write_data
writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 4: ordinal not in range(128)
The minidom writer seems to be unaware of Unicode.
Why does it work with codecs
?
Is there a way to fix that?
The writexml
method seems to dump str
always. Reading the documentation tells me that its encoding
argument only adds the encoding attribute to the XML header.
Changed in version 2.3: For the Document node, an additional keyword argument encoding can be used to specify the encoding field of the XML header.
You can try instead:
fd.write(tree.toxml(encoding="utf-8").decode("utf-8"))
The above will save the XML as UTF-8 and specifies the encoding in the XML header as well.
If you do not specify encoding, it will still save as UTF-8, but the encoding attribute won't be included in the header.
fd.write(tree.toxml())
If you specify encoding, but don't decode()
, it will raise an exception as toxml()
returns a str
, which is quite strange, though.
TypeError: write() argument 1 must be unicode, not str