Search code examples
pythonxmlpython-2.7elementtree

Faithfully Preserve Comments in Parsed XML


I'd like to preserve comments as faithfully as possible while manipulating XML.

I managed to preserve comments, but the contents are getting XML-escaped.

#!/usr/bin/env python
# add_host_to_tomcat.py

import xml.etree.ElementTree as ET
from CommentedTreeBuilder import CommentedTreeBuilder
parser = CommentedTreeBuilder()

if __name__ == '__main__':
    filename = "/opt/lucee/tomcat/conf/server.xml"

    # this is the important part: use the comment-preserving parser
    tree = ET.parse(filename, parser)

    # get the node to add a child to
    engine_node = tree.find("./Service/Engine")

    # add a node: Engine.Host
    host_node = ET.SubElement(
        engine_node,
        "Host",
        name="local.mysite.com",
        appBase="webapps"
    )
    # add a child to new node: Engine.Host.Context
    ET.SubElement(
        host_node,
        'Context',
        path="",
        docBase="/path/to/doc/base"
    )

    tree.write('out.xml')
#!/usr/bin/env python
# CommentedTreeBuilder.py

from xml.etree import ElementTree

class CommentedTreeBuilder ( ElementTree.XMLTreeBuilder ):
    def __init__ ( self, html = 0, target = None ):
        ElementTree.XMLTreeBuilder.__init__( self, html, target )
        self._parser.CommentHandler = self.handle_comment

    def handle_comment ( self, data ):
        self._target.start( ElementTree.Comment, {} )
        self._target.data( data )
        self._target.end( ElementTree.Comment )

However, comments like like:

  <!--
EXAMPLE HOST ENTRY:
    <Host name="lucee.org" appBase="webapps">
         <Context path="" docBase="/var/sites/getrailo.org" />
     <Alias>www.lucee.org</Alias>
     <Alias>my.lucee.org</Alias>
    </Host>

HOST ENTRY TEMPLATE:
    <Host name="[ENTER DOMAIN NAME]" appBase="webapps">
         <Context path="" docBase="[ENTER SYSTEM PATH]" />
     <Alias>[ENTER DOMAIN ALIAS]</Alias>
    </Host>
  -->

End up as:

  <!--
            EXAMPLE HOST ENTRY:
    &lt;Host name="lucee.org" appBase="webapps"&gt;
         &lt;Context path="" docBase="/var/sites/getrailo.org" /&gt;
         &lt;Alias&gt;www.lucee.org&lt;/Alias&gt;
         &lt;Alias&gt;my.lucee.org&lt;/Alias&gt;
    &lt;/Host&gt;

    HOST ENTRY TEMPLATE:
    &lt;Host name="[ENTER DOMAIN NAME]" appBase="webapps"&gt;
         &lt;Context path="" docBase="[ENTER SYSTEM PATH]" /&gt;
         &lt;Alias&gt;[ENTER DOMAIN ALIAS]&lt;/Alias&gt;
    &lt;/Host&gt;
   -->

I also tried self._target.data( saxutils.unescape(data) ) in CommentedTreeBuilder.py, but it didn't seem to do anything. In fact, I think the problem happens somewhere after the handle_commment() step.

By the way, this question is similar to this.


Solution

  • Tested with Python 2.7 and 3.5, the following code should work as intended.

    #!/usr/bin/env python
    # CommentedTreeBuilder.py
    from xml.etree import ElementTree
    
    class CommentedTreeBuilder(ElementTree.TreeBuilder):
        def comment(self, data):
            self.start(ElementTree.Comment, {})
            self.data(data)
            self.end(ElementTree.Comment)
    

    Then, in the main code use

    parser = ElementTree.XMLParser(target=CommentedTreeBuilder())
    

    as the parser instead of the current one.

    By the way, comments work correctly out of the box with lxml. That is, you can just do

    import lxml.etree as ET
    tree = ET.parse(filename)
    

    without needing any of the above.