Search code examples
pythonpython-2.7iominidom

Can't serialize minidom tree with io module


I have to work with legacy code which use xml.dom.minidom (and I can't migrate to lxml).

I'd like to parse this minimal sample:

<body>
    <p>English</p>
    <p>Français</p>
</body>

The following function works perfectly:

import codecs
import xml.dom.minidom


def transform1(src_path, dst_path):
    tree = xml.dom.minidom.parse(src_path)
    # ...
    with codecs.open(dst_path, mode="w", encoding="utf-8") as fd:
        tree.writexml(fd, encoding="utf-8")

But, if I change to use io instead, everything goes wrong:

Traceback (most recent call last):
  File "/path/to/minidom_demo.py", line 23, in <module>
    transform2("sample.xml", "result.xml")
  File "/path/to/minidom_demo.py", line 18, in transform2
    tree.writexml(fd, encoding="utf-8")
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1747, in writexml
    writer.write('<?xml version="1.0" encoding="%s"?>%s' % (encoding, newl))
TypeError: must be unicode, not str

If I open the file in binary mode (mode="wb") I have another exception saying:

Traceback (most recent call last):
  File "/path/to/minidom_demo.py", line 23, in <module>
    transform2("sample.xml", "result.xml")
  File "/path/to/minidom_demo.py", line 18, in transform2
    tree.writexml(fd, encoding="utf-8")
  ...
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 298, in _write_data
    writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 4: ordinal not in range(128)

The minidom writer seems to be unaware of Unicode.

Why does it work with codecs?

Is there a way to fix that?


Solution

  • The writexml method seems to dump str always. Reading the documentation tells me that its encoding argument only adds the encoding attribute to the XML header.

    Changed in version 2.3: For the Document node, an additional keyword argument encoding can be used to specify the encoding field of the XML header.

    You can try instead:

    fd.write(tree.toxml(encoding="utf-8").decode("utf-8"))
    

    The above will save the XML as UTF-8 and specifies the encoding in the XML header as well.

    If you do not specify encoding, it will still save as UTF-8, but the encoding attribute won't be included in the header.

    fd.write(tree.toxml())
    

    If you specify encoding, but don't decode(), it will raise an exception as toxml() returns a str, which is quite strange, though.

    TypeError: write() argument 1 must be unicode, not str