python xml xpath xml-parsing elementtree

How to replace xml node value in Python, without changing the whole file

Doing my first steps in python I try to parse and update a xml file. The xml is as follows:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<!DOCTYPE eu:eu-backbone SYSTEM "../../util/dtd/eu-regional.dtd"[]>
<test dtd-version="3.2" xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink">
  <mr>
    <leaf  checksum="88ed245997a341a4c7d1e40d614eb14f"   >
      <title>book name</title>
    </leaf>
  </mr>
</test>

I would like to update the value of the checksum.I have written a class with one method:

    @staticmethod
    def replace_checksum_in_index_xml(xml_file_path, checksum):
        logging.debug(f"ReplaceChecksumInIndexXml xml_file_path: {xml_file_path}")
        try:
            from xml.etree import ElementTree as et
            tree = et.parse(xml_file_path)
            tree.find('.//leaf').set("checksum", checksum)
            tree.write(xml_file_path)
        except Exception as e:
            logging.error(f"Error updating checksum in {xml_file_path}: {e}")

I call the method:

    xml_file_path = "index.xml"
    checksum = "aaabbb"
    Hashes.replace_checksum_in_index_xml(xml_file_path, checksum)

The checksum is indeed updated. But also the whole xml structure is changed:

<test dtd-version="3.2">
  <mr>
    <leaf checksum="aaabbb">
      <title>book name</title>
    </leaf>
  </mr>
</test>

How can I update only given node, without changing anything else in given xml file?

UPDATE

Solution provided by LRD_Clan is better than my original one. But it is still changing a bit the structure of xml. Also when I take more complex example I see again part of xml is removed. More complex example with additional DOCTYPE:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE eu:eu-backbone SYSTEM "../../util/dtd/eu-regional.dtd"[]>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>

<test dtd-version="3.2" xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink">
  <mr>
    <leaf  checksum="88ed245997a341a4c7d1e40d614eb14f"   >
      <title>book name</title>
    </leaf>

  </mr>
</test>

after running updated script the result is:

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?><test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
  <mr>
    <leaf checksum="aaabbb">
      <title>book name</title>
    </leaf>
  </mr>
</test>

I would really like to see only this one xml element being changed and the other part of document to be left intact.

I had similiar solution written in powershell which looks like

 [xml]$xmlContent = Get-Content -Path $xmlFilePath
 $element = $xmlContent.SelectSingleNode("//leaf")
 $element.SetAttribute("checksum, "new text") 
 $xmlContent.Save((Resolve-Path "$xmlFilePath").Path)

I was hoping I will find something at least same elegant in python.

Solution

In addition to LMC’s answer a small modification. You can adjust the parser to keep comments and process instructions:

(Update: add Doctype info to the xml string manually)

from lxml import etree

def replace_checksum(infile, new_value):
    parser = etree.XMLParser(remove_comments=False, remove_pis=False)
    root = etree.parse(infile, parser)

    dtd = root.docinfo.doctype + '\n'

    for elem in root.xpath("//leaf[@checksum]"):
        elem.set("checksum", new_value)
    
    updated_xml = etree.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8").decode("utf-8")
    
    # add the doctype manually
    i = updated_xml.find('?>\n')
    if len(dtd) > 2:
        updated_doc = updated_xml[:i + len('?>\n')] + dtd + updated_xml[i + len('?>\n'):]
        return updated_doc
    else:
        return updated_xml

if __name__ == "__main__":
    check_sum = "aaabbb"
    outfile = replace_checksum("index.xml", check_sum)
    print(outfile)

Output:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE test SYSTEM "../../util/dtd/eu-regional.dtd">
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
  <mr>
    <leaf checksum="aaabbb">
      <title>book name</title>
    </leaf>
  </mr>
</test>

As an ALTERNATIVE, parser doc

The parser option remove_blank_text only removes empty text nodes.

Comments, processing instructions and the doctype are not affected and remain in the parsed document.

from lxml import etree

def replace_checksum(infile, checksum_new, outfile):
    parser = etree.XMLParser(remove_blank_text=True)
    with open(infile, "rb") as f:
        tree = etree.parse(f, parser)
    
    for elem in tree.xpath("//leaf[@checksum]"):  
        elem.set("checksum", checksum_new)
    
    with open(outfile, "wb") as f:
        tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)

if __name__ == "__main__":
    infile = "index.xml"
    outfile = "index_new.xml"
    check_sum = "aaabbb"
    replace_checksum(infile, check_sum, outfile)

File:

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<!DOCTYPE test SYSTEM "../../util/dtd/eu-regional.dtd">
<test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
  <mr>
    <leaf checksum="aaabbb">
      <title>book name</title>
    </leaf>
  </mr>
</test>