Search code examples
pythonxmlxml-parsingdtddoctype

How to process and update (change attribute, add node, etc) XML file with a DOCTYPE in Python, without removing nor altering the "DOCTYPE"


I have couple of xml files which I would like to process and update their nodes/attributes. I have couple of examples of scripts which can do that, but all of them change a bit the xml structure, remove or alter the DOCTYPE. The simplified example of xml is:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note:note SYSTEM "note.dtd">
<note:note  xmlns:note="http://example.com/note">
  <to checksum="abc">Tove</to> 
</note:note>

the DTD note.dtd is:

<!ELEMENT note:note (to)>
<!ELEMENT to (#PCDATA)>
 <!ATTLIST to
    checksum CDATA #REQUIRED
>

Example python script which updates argument value is:

    @staticmethod
    def replace_checksum_in_index_xml(infile, checksum_new, outfile):
        from lxml import etree
        parser = etree.XMLParser(remove_blank_text=True)
        with open(infile, "rb") as f:
            tree = etree.parse(f, parser)

        for elem in tree.xpath("//to[@checksum]"):
            elem.set("checksum", checksum_new)

        with open(outfile, "wb") as f:
            tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=tree.docinfo.doctype)

I call the script like that:

    infile = "Input.xml"
    check_sum = "aaabbb"
    outfile = "Output.xml"
    Hashes.replace_checksum_in_index_xml(infile, check_sum, outfile)

And the result xml file is:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE note SYSTEM "note.dtd">
<note:note xmlns:note="http://example.com/note">
  <to checksum="aaabbb">Tove</to>
</note:note>

The output DOCTYPE has changed and instead of
DOCTYPE note:note
there is
DOCTYPE note I would like to keep the DOCTYPE as it was. How can I achieve desired result in Python?


Solution

  • I think the stripping of the prefix in the doctype is no failure. If you really like the prefix you can write it explictly:

    from lxml import etree
    
    def read_doctype_name(filename):
        with open(filename, "r", encoding="UTF-8") as f:
            for line in f:
                if line.startswith('<!DOCTYPE'):
                    return line
        
    def replace_checksum_in_index_xml(infile, checksum_new, outfile, docName):
        parser = etree.XMLParser(remove_blank_text=True)
        with open(infile, "rb") as f:
            tree = etree.parse(f, parser)
    
        for elem in tree.xpath("//to[@checksum]"):
            elem.set("checksum", checksum_new)
    
        with open(outfile, "wb") as f:
            if docName is not None:
                tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8", doctype=docName.strip())
            else:
                tree.write(f, pretty_print=True, xml_declaration=True, encoding="UTF-8")
            
    if __name__ == "__main__":
        doctyp = None
        doctyp = read_doctype_name("infile.xml")
        replace_checksum_in_index_xml("infile.xml", "aaabbb", "outfile.xml", doctyp)
        print("finish")
    

    File:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE note:note SYSTEM "note.dtd">
    <note:note xmlns:note="http://example.com/note">
      <to checksum="aaabbb">Tove</to>
    </note:note>
    

    Alternative you can use a regex function to extract the DOCTYPE string:

    import re
    
    def read_doctype_name(filename):
        with open(filename, "r", encoding="UTF-8") as f:
            xml_ = f.read()
            if re.search(r'<!DOCTYPE[^\>]*>', xml_) is not None:
                doctype_match = re.search(r'<!DOCTYPE[^\>]*>', xml_)
                return doctype_match[0]
            else:
                return None