Doing my first steps in python I try to parse and update a xml file. The xml is as follows:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<!DOCTYPE eu:eu-backbone SYSTEM "../../util/dtd/eu-regional.dtd"[]>
<test dtd-version="3.2" xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink">
<mr>
<leaf checksum="88ed245997a341a4c7d1e40d614eb14f" >
<title>book name</title>
</leaf>
</mr>
</test>
I would like to update the value of the checksum.I have written a class with one method:
@staticmethod
def replace_checksum_in_index_xml(xml_file_path, checksum):
logging.debug(f"ReplaceChecksumInIndexXml xml_file_path: {xml_file_path}")
try:
from xml.etree import ElementTree as et
tree = et.parse(xml_file_path)
tree.find('.//leaf').set("checksum", checksum)
tree.write(xml_file_path)
except Exception as e:
logging.error(f"Error updating checksum in {xml_file_path}: {e}")
I call the method:
xml_file_path = "index.xml"
checksum = "aaabbb"
Hashes.replace_checksum_in_index_xml(xml_file_path, checksum)
The checksum is indeed updated. But also the whole xml structure is changed:
<test dtd-version="3.2">
<mr>
<leaf checksum="aaabbb">
<title>book name</title>
</leaf>
</mr>
</test>
How can I update only given node, without changing anything else in given xml file?
UPDATE
Solution provided by LRD_Clan is better than my original one. But it is still changing a bit the structure of xml. Also when I take more complex example I see again part of xml is removed. More complex example with additional DOCTYPE:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE eu:eu-backbone SYSTEM "../../util/dtd/eu-regional.dtd"[]>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<test dtd-version="3.2" xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink">
<mr>
<leaf checksum="88ed245997a341a4c7d1e40d614eb14f" >
<title>book name</title>
</leaf>
</mr>
</test>
after running updated script the result is:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?><test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
<mr>
<leaf checksum="aaabbb">
<title>book name</title>
</leaf>
</mr>
</test>
I would really like to see only this one xml element being changed and the other part of document to be left intact.
I had similiar solution written in powershell which looks like
[xml]$xmlContent = Get-Content -Path $xmlFilePath
$element = $xmlContent.SelectSingleNode("//leaf")
$element.SetAttribute("checksum, "new text")
$xmlContent.Save((Resolve-Path "$xmlFilePath").Path)
I was hoping I will find something at least same elegant in python.
In addition to LMC’s answer a small modification. You can adjust the parser to keep comments and process instructions:
(Update: add Doctype info to the xml string manually)
from lxml import etree
def replace_checksum(infile, new_value):
parser = etree.XMLParser(remove_comments=False, remove_pis=False)
root = etree.parse(infile, parser)
dtd = root.docinfo.doctype + '\n'
for elem in root.xpath("//leaf[@checksum]"):
elem.set("checksum", new_value)
updated_xml = etree.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8").decode("utf-8")
# add the doctype manually
i = updated_xml.find('?>\n')
if len(dtd) > 2:
updated_doc = updated_xml[:i + len('?>\n')] + dtd + updated_xml[i + len('?>\n'):]
return updated_doc
else:
return updated_xml
if __name__ == "__main__":
check_sum = "aaabbb"
outfile = replace_checksum("index.xml", check_sum)
print(outfile)
Output:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE test SYSTEM "../../util/dtd/eu-regional.dtd">
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
<mr>
<leaf checksum="aaabbb">
<title>book name</title>
</leaf>
</mr>
</test>
As an ALTERNATIVE, parser doc
The parser option remove_blank_text only removes empty text nodes.
Comments, processing instructions and the doctype are not affected and remain in the parsed document.
from lxml import etree
def replace_checksum(infile, checksum_new, outfile):
parser = etree.XMLParser(remove_blank_text=True)
with open(infile, "rb") as f:
tree = etree.parse(f, parser)
for elem in tree.xpath("//leaf[@checksum]"):
elem.set("checksum", checksum_new)
with open(outfile, "wb") as f:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)
if __name__ == "__main__":
infile = "index.xml"
outfile = "index_new.xml"
check_sum = "aaabbb"
replace_checksum(infile, check_sum, outfile)
File:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet href="util/style/aaaa-2-0.xsl" type="text/xsl"?>
<!DOCTYPE test SYSTEM "../../util/dtd/eu-regional.dtd">
<test xmlns:test="http://www.ich.org/test" xmlns:xlink="http://www.w3c.org/1999/xlink" dtd-version="3.2">
<mr>
<leaf checksum="aaabbb">
<title>book name</title>
</leaf>
</mr>
</test>