Search code examples

Python: xml.etree.ElementTree destroys xml format

I'm having an ISM file (InstallShield project) that is formatted as XML.

I need to change some attributes in the file, so I used xml.etree.ElementTree (Python Library).

I can find the values and change them, however, after saving the file with updated values, I can't open it in InstallShield (I get a general error that file cant be open).

When I compare the old file with the new one, I see that beside the values I changed, some lines are simply missing from new XML and in some line the tags name had changed.

Why does it happen? Is there anything to do to make the file stay exactly as it was except for the changes I've made? Should I use other tool to make the change?

For example, the following section appears in original XML:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="is.xsl" ?>
<!DOCTYPE msi [
   <!ELEMENT msi   (summary,table*)>
   <!ATTLIST msi version    CDATA #REQUIRED>
   <!ATTLIST msi xmlns:dt   CDATA #IMPLIED
                 codepage   CDATA #IMPLIED
                 compression (MSZIP|LZX|none) "LZX">

   <!ELEMENT summary       (codepage?,title?,subject?,author?,keywords?,comments?,

   <!ELEMENT codepage      (#PCDATA)>
   <!ELEMENT title         (#PCDATA)>
   <!ELEMENT subject       (#PCDATA)>
   <!ELEMENT author        (#PCDATA)>
   <!ELEMENT keywords      (#PCDATA)>
   <!ELEMENT comments      (#PCDATA)>
   <!ELEMENT template      (#PCDATA)>
   <!ELEMENT lastauthor    (#PCDATA)>
   <!ELEMENT revnumber     (#PCDATA)>
   <!ELEMENT lastprinted   (#PCDATA)>
   <!ELEMENT createdtm     (#PCDATA)>
   <!ELEMENT lastsavedtm   (#PCDATA)>
   <!ELEMENT pagecount     (#PCDATA)>
   <!ELEMENT wordcount     (#PCDATA)>
   <!ELEMENT charcount     (#PCDATA)>
   <!ELEMENT appname       (#PCDATA)>
   <!ELEMENT security      (#PCDATA)>                            

   <!ELEMENT table         (col+,row*)>
   <!ATTLIST table
                name        CDATA #REQUIRED>

   <!ELEMENT col           (#PCDATA)>
   <!ATTLIST col
                 key       (yes|no) #IMPLIED
                 def       CDATA #IMPLIED>

   <!ELEMENT row            (td+)>

   <!ELEMENT td             (#PCDATA)>
   <!ATTLIST td
                 href       CDATA #IMPLIED
                 dt:dt     (string|bin.base64) #IMPLIED
                 md5        CDATA #IMPLIED>
<msi version="2.0" xmlns:dt="urn:schemas-microsoft-com:datatypes" codepage="65001">

But in the new XML it's gone and instead there is only:

<msi xmlns:ns0="urn:schemas-microsoft-com:datatypes" codepage="65001" version="2.0">

There are more differences, this is just an example.

The python code I use to make the change is

   tree = Et.parse(ism_file_path)
    root = tree.getroot()

    for attributes_group in root:
        for attribute in attributes_group:

            if attribute.tag == "revnumber":

                new_package_code = increment_hex_number(attribute.text)

                attribute.text = new_package_code


Thank you!


  • Eventually I moved to a new library - lxml. This library, in opposed to xml.etree.ElementTree keeps the order of all tags, so I did exactly the same and it worked:

    def modify_ism_file(ism_file_path):
        context = etree.iterparse(ism_file_path)
        for action, attributes_group in context:
            for attribute in attributes_group:
                if attribute.tag == "revnumber":
                    print "Found package code. TAG = {0} TEXT = {1}".format(attribute.tag, attribute.text)
                    new_package_code = increment_hex_number(attribute.text)
                    print "New package code is {0}".format(new_package_code)
                    attribute.text = new_package_code
    obj_xml = etree.tostring(context.root, pretty_print=True, xml_declaration=True,   encoding="utf-8")
        with open(ism_file_path, "w") as f: