Can't serialize minidom tree with io module

I have to work with legacy code which use xml.dom.minidom (and I can't migrate to lxml).

I'd like to parse this minimal sample:

<body>
    <p>English</p>
    <p>Français</p>
</body>

The following function works perfectly:

import codecs
import xml.dom.minidom


def transform1(src_path, dst_path):
    tree = xml.dom.minidom.parse(src_path)
    # ...
    with codecs.open(dst_path, mode="w", encoding="utf-8") as fd:
        tree.writexml(fd, encoding="utf-8")

But, if I change to use io instead, everything goes wrong:

Traceback (most recent call last):
  File "/path/to/minidom_demo.py", line 23, in <module>
    transform2("sample.xml", "result.xml")
  File "/path/to/minidom_demo.py", line 18, in transform2
    tree.writexml(fd, encoding="utf-8")
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1747, in writexml
    writer.write('<?xml version="1.0" encoding="%s"?>%s' % (encoding, newl))
TypeError: must be unicode, not str

If I open the file in binary mode (mode="wb") I have another exception saying:

Traceback (most recent call last):
  File "/path/to/minidom_demo.py", line 23, in <module>
    transform2("sample.xml", "result.xml")
  File "/path/to/minidom_demo.py", line 18, in transform2
    tree.writexml(fd, encoding="utf-8")
  ...
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 298, in _write_data
    writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 4: ordinal not in range(128)

The minidom writer seems to be unaware of Unicode.

Why does it work with codecs?

Is there a way to fix that?

Solution

The writexml method seems to dump str always. Reading the documentation tells me that its encoding argument only adds the encoding attribute to the XML header.

Changed in version 2.3: For the Document node, an additional keyword argument encoding can be used to specify the encoding field of the XML header.

You can try instead:

fd.write(tree.toxml(encoding="utf-8").decode("utf-8"))

The above will save the XML as UTF-8 and specifies the encoding in the XML header as well.

If you do not specify encoding, it will still save as UTF-8, but the encoding attribute won't be included in the header.

fd.write(tree.toxml())

If you specify encoding, but don't decode(), it will raise an exception as toxml() returns a str, which is quite strange, though.

TypeError: write() argument 1 must be unicode, not str