How to process and update (change attribute, add node, etc) XML file with a DOCTYPE in Python, without removing nor altering the "DOCTYPE"

I have couple of xml files which I would like to process and update their nodes/attributes. I have couple of examples of scripts which can do that, but all of them change a bit the xml structure, remove or alter the DOCTYPE. The simplified example of xml is:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note  xmlns:note="http://example.com/note">
  <to checksum="abc">Tove</to> 
</note:note>

the DTD note.dtd is:

<!ELEMENT note:note (to)>
<!ELEMENT to (#PCDATA)>
 <!ATTLIST to
    checksum CDATA #REQUIRED
>

Example python script which updates argument value is:

    @staticmethod
    def replace_checksum_in_index_xml(infile, checksum_new, outfile):
        from lxml import etree
        parser = etree.XMLParser(remove_blank_text=True)
        with open(infile, "rb") as f:
            tree = etree.parse(f, parser)

        for elem in tree.xpath("//to[@checksum]"):
            elem.set("checksum", checksum_new)

        with open(outfile, "wb") as f:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)

I call the script like that:

    infile = "Input.xml"
    check_sum = "aaabbb"
    outfile = "Output.xml"
    Hashes.replace_checksum_in_index_xml(infile, check_sum, outfile)

And the result xml file is:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
  <to checksum="aaabbb">Tove</to>
</note:note>

The output DOCTYPE has changed and instead of
DOCTYPE note:note
there is
DOCTYPE note I would like to keep the DOCTYPE as it was. How can I achieve desired result in Python?

Solution

I think the stripping of the prefix in the doctype is no failure. If you really like the prefix you can write it explictly:

from lxml import etree

def read_doctype_name(filename):
    with open(filename, "r", encoding="UTF-8") as f:
        for line in f:
            if line.startswith('<!DOCTYPE'):
                return line
    
def replace_checksum_in_index_xml(infile, checksum_new, outfile, docName):
    parser = etree.XMLParser(remove_blank_text=True)
    with open(infile, "rb") as f:
        tree = etree.parse(f, parser)

    for elem in tree.xpath("//to[@checksum]"):
        elem.set("checksum", checksum_new)

    with open(outfile, "wb") as f:
        if docName is not None:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=docName.strip())
        else:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8")
        
if __name__ == "__main__":
    doctyp = None
    doctyp = read_doctype_name("infile.xml")
    replace_checksum_in_index_xml("infile.xml", "aaabbb", "outfile.xml", doctyp)
    print("finish")

File:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
  <to checksum="aaabbb">Tove</to>
</note:note>

Alternative you can use a regex function to extract the DOCTYPE string:

import re

def read_doctype_name(filename):
    with open(filename, "r", encoding="UTF-8") as f:
        xml_ = f.read()
        if re.search(r'<!DOCTYPE[^\>]*>', xml_) is not None:
            doctype_match = re.search(r'<!DOCTYPE[^\>]*>', xml_)
            return doctype_match[0]
        else:
            return None