I have couple of xml files which I would like to process and update their nodes/attributes. I have couple of examples of scripts which can do that, but all of them change a bit the xml structure, remove or alter the DOCTYPE. The simplified example of xml is:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="abc">Tove</to>
</note:note>
the DTD note.dtd is:
<!ELEMENT note:note (to)>
<!ELEMENT to (#PCDATA)>
<!ATTLIST to
checksum CDATA #REQUIRED
>
Example python script which updates argument value is:
@staticmethod
def replace_checksum_in_index_xml(infile, checksum_new, outfile):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
with open(infile, "rb") as f:
tree = etree.parse(f, parser)
for elem in tree.xpath("//to[@checksum]"):
elem.set("checksum", checksum_new)
with open(outfile, "wb") as f:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)
I call the script like that:
infile = "Input.xml"
check_sum = "aaabbb"
outfile = "Output.xml"
Hashes.replace_checksum_in_index_xml(infile, check_sum, outfile)
And the result xml file is:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="aaabbb">Tove</to>
</note:note>
The output DOCTYPE has changed and instead of
DOCTYPE note:note
there is
DOCTYPE note
I would like to keep the DOCTYPE as it was.
How can I achieve desired result in Python?
I think the stripping of the prefix in the doctype is no failure. If you really like the prefix you can write it explictly:
from lxml import etree
def read_doctype_name(filename):
with open(filename, "r", encoding="UTF-8") as f:
for line in f:
if line.startswith('<!DOCTYPE'):
return line
def replace_checksum_in_index_xml(infile, checksum_new, outfile, docName):
parser = etree.XMLParser(remove_blank_text=True)
with open(infile, "rb") as f:
tree = etree.parse(f, parser)
for elem in tree.xpath("//to[@checksum]"):
elem.set("checksum", checksum_new)
with open(outfile, "wb") as f:
if docName is not None:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=docName.strip())
else:
tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8")
if __name__ == "__main__":
doctyp = None
doctyp = read_doctype_name("infile.xml")
replace_checksum_in_index_xml("infile.xml", "aaabbb", "outfile.xml", doctyp)
print("finish")
File:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
<to checksum="aaabbb">Tove</to>
</note:note>
Alternative you can use a regex function to extract the DOCTYPE string:
import re
def read_doctype_name(filename):
with open(filename, "r", encoding="UTF-8") as f:
xml_ = f.read()
if re.search(r'<!DOCTYPE[^\>]*>', xml_) is not None:
doctype_match = re.search(r'<!DOCTYPE[^\>]*>', xml_)
return doctype_match[0]
else:
return None