Search code examples
pythonxmlxpathxml-parsingelementtree

How to replace xml node value in Python, without changing the whole file


Doing my first steps in python I try to parse and update a xml file. The xml is as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<!DOCTYPE eu:eu-backbone SYSTEM "../../util/dtd/eu-regional.dtd"[]>
<test dtd-version="3.2" xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink">
  <mr>
    <leaf  checksum="88ed245997a341a4c7d1e40d614eb14f"   >
      <title>book name</title>
    </leaf>
  </mr>
</test>

I would like to update the value of the checksum.I have written a class with one method:

    @staticmethod
    def replace_checksum_in_index_xml(xml_file_path, checksum):
        logging.debug(f"ReplaceChecksumInIndexXml xml_file_path: {xml_file_path}")
        try:
            from xml.etree import ElementTree as et
            tree = et.parse(xml_file_path)
            tree.find('.//leaf').set("checksum", checksum)
            tree.write(xml_file_path)
        except Exception as e:
            logging.error(f"Error updating checksum in {xml_file_path}: {e}")

I call the method:

    xml_file_path = "index.xml"
    checksum = "aaabbb"
    Hashes.replace_checksum_in_index_xml(xml_file_path, checksum)

The checksum is indeed updated. But also the whole xml structure is changed:

<test dtd-version="3.2">
  <mr>
    <leaf checksum="aaabbb">
      <title>book name</title>
    </leaf>
  </mr>
</test>

How can I update only given node, without changing anything else in given xml file?

UPDATE

Solution provided by LRD_Clan is better than my original one. But it is still changing a bit the structure of xml. Also when I take more complex example I see again part of xml is removed. More complex example with additional DOCTYPE:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE eu:eu-backbone SYSTEM "../../util/dtd/eu-regional.dtd"[]>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>

<test dtd-version="3.2" xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink">
  <mr>
    <leaf  checksum="88ed245997a341a4c7d1e40d614eb14f"   >
      <title>book name</title>
    </leaf>

  </mr>
</test>

after running updated script the result is:

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?><test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
  <mr>
    <leaf checksum="aaabbb">
      <title>book name</title>
    </leaf>
  </mr>
</test>

I would really like to see only this one xml element being changed and the other part of document to be left intact.

I had similiar solution written in powershell which looks like

 [xml]$xmlContent = Get-Content -Path $xmlFilePath
 $element = $xmlContent.SelectSingleNode("//leaf")
 $element.SetAttribute("checksum, "new text") 
 $xmlContent.Save((Resolve-Path "$xmlFilePath").Path) 

I was hoping I will find something at least same elegant in python.


Solution

  • In addition to LMC’s answer a small modification. You can adjust the parser to keep comments and process instructions:

    (Update: add Doctype info to the xml string manually)

    from lxml import etree
    
    def replace_checksum(infile, new_value):
        parser = etree.XMLParser(remove_comments=False, remove_pis=False)
        root = etree.parse(infile, parser)
    
        dtd = root.docinfo.doctype + '\n'
    
        for elem in root.xpath("//leaf[@checksum]"):
            elem.set("checksum", new_value)
        
        updated_xml = etree.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8").decode("utf-8")
        
        # add the doctype manually
        i = updated_xml.find('?>\n')
        if len(dtd) > 2:
            updated_doc = updated_xml[:i + len('?>\n')] + dtd + updated_xml[i + len('?>\n'):]
            return updated_doc
        else:
            return updated_xml
    
    if __name__ == "__main__":
        check_sum = "aaabbb"
        outfile = replace_checksum("index.xml", check_sum)
        print(outfile)
    

    Output:

    <?xml version='1.0' encoding='UTF-8'?>
    <!DOCTYPE test SYSTEM "../../util/dtd/eu-regional.dtd">
    <?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
    <test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
      <mr>
        <leaf checksum="aaabbb">
          <title>book name</title>
        </leaf>
      </mr>
    </test>
    

    As an ALTERNATIVE, parser doc

    The parser option remove_blank_text only removes empty text nodes.

    Comments, processing instructions and the doctype are not affected and remain in the parsed document.

    from lxml import etree
    
    def replace_checksum(infile, checksum_new, outfile):
        parser = etree.XMLParser(remove_blank_text=True)
        with open(infile, "rb") as f:
            tree = etree.parse(f, parser)
        
        for elem in tree.xpath("//leaf[@checksum]"):  
            elem.set("checksum", checksum_new)
        
        with open(outfile, "wb") as f:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)
    
    if __name__ == "__main__":
        infile = "index.xml"
        outfile = "index_new.xml"
        check_sum = "aaabbb"
        replace_checksum(infile, check_sum, outfile)
    

    File:

    <?xml version='1.0' encoding='UTF-8'?>
    <?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
    <!DOCTYPE test SYSTEM "../../util/dtd/eu-regional.dtd">
    <test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
      <mr>
        <leaf checksum="aaabbb">
          <title>book name</title>
        </leaf>
      </mr>
    </test>