Search code examples
pythonxmltodict

xmltodict.unparse is not handling CDATA properly


I am trying to use xmltodict to manipulate an XML content as python object, but I am facing an issue to handle properly CDATA. I think I am missing something somewhere, this is my code:

import xmltodict

data = """<node1>
    <node2 id='test'><![CDATA[test]]></node2>
    <node3 id='test'>test</node3>
</node1>"""

data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data

print xmltodict.unparse(data, pretty=True)  

And this is the output:

OrderedDict([(u'node1', OrderedDict([(u'node2', OrderedDict([(u'@id', u'test'), ('#text', u'test')])), (u'node3', OrderedDict([(u'@id', u'test'), ('#text', u'test')]))]))])
<?xml version="1.0" encoding="utf-8"?>
<node1>
        <node2 id="test">test</node2>
        <node3 id="test">test</node3>
</node1>

We can see here that the CDATA is missing in the generated node2, and also node2 is the same as node3. However, in the input the nodes are different.

Regards


Solution

  • I finally managed to get it working by performing this monkey-patch. I am still not very happy with it, It's really a 'hack' this feature should be included somewhere properly:

    import xmltodict
    def escape_hacked(data, entities={}):
        if data[0] == '<' and  data.strip()[-1] == '>':
            return '<![CDATA[%s]]>' % data
    
        return escape_orig(data, entities)
    
    
    xml.sax.saxutils.escape = escape_hacked
    

    and then run your python code normally:

    data = """<node1>
        <node2 id='test'><![CDATA[test]]></node2>
        <node3 id='test'>test</node3>
    </node1>"""
    
    data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
    print data
    
    print xmltodict.unparse(data, pretty=True) 
    

    To explain, the following line detect if the data is a valid XML, then it add the CDATA tag arround it:

        if data[0] == '<' and  data.strip()[-1] == '>':
            return '<![CDATA[%s]]>' % data
    

    Regards