Search code examples
pythonxmlfile-iominidom

Writing XML to file corrupts files in python


I'm attempting to write contents from xml.dom.minidom object to file. The simple idea is to use 'writexml' method:

import codecs

def write_xml_native():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    f = codecs.open('codified.xml', mode='w', encoding='utf-8')
    # Using native writexml() method to write
    xmldoc.writexml(f, encoding="utf=8")
    f.close()

The problem is that it corrupts the non-latin-encoded text in the file. The other way is to get the text string and write it to file explicitly:

def write_xml():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    # Opening file for writing UTF-8, which is XML's default encoding
    f = codecs.open('codified3.xml', mode='w', encoding='utf-8')
    # Writing XML in UTF-8 encoding, as recommended in the documentation
    f.write(xmldoc.toxml("utf-8"))
    f.close()

This results in the following error:

Traceback (most recent call last):
  File "D:\Projects\Semio\semioparser.py", line 45, in <module>
    write_xml()
  File "D:\Projects\Semio\semioparser.py", line 42, in write_xml
    f.write(xmldoc.toxml(encoding="utf-8"))
  File "C:\Python26\lib\codecs.py", line 686, in write
    return self.writer.write(data)
  File "C:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 2064: ordinal not in range(128)

How do I write an XML text to file? What is it I'm missing?

EDIT. Error is fixed by adding decode statement: f.write(xmldoc.toxml("utf-8").decode("utf-8")) But russian symbols are still corrupted.

The text is not corrupted when viewed in an interpreter, but when it's written in file.


Solution

  • Hmm, though this should work:

    xml = minidom.parse("test.xml")
    with codecs.open("out.xml", "w", "utf-8") as out:
        xml.writexml(out)
    

    you may alternatively try:

    with codecs.open("test.xml", "r", "utf-8") as inp:
        xml = minidom.parseString(inp.read().encode("utf-8"))
    with codecs.open("out.xml", "w", "utf-8") as out:
        xml.writexml(out)
    

    Update: In case you construct xml out of string object, you should encode it before passing to minidom parser, like this:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import codecs
    import xml.dom.minidom as minidom
    
    xml = minidom.parseString(u"<ru>Тест</ru>".encode("utf-8"))
    with codecs.open("out.xml", "w", "utf-8") as out:
        xml.writexml(out)