I am trying to use xmltodict to manipulate an XML content as python object, but I am facing an issue to handle properly CDATA. I think I am missing something somewhere, this is my code:
import xmltodict
data = """<node1>
<node2 id='test'><![CDATA[test]]></node2>
<node3 id='test'>test</node3>
</node1>"""
data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data
print xmltodict.unparse(data, pretty=True)
And this is the output:
OrderedDict([(u'node1', OrderedDict([(u'node2', OrderedDict([(u'@id', u'test'), ('#text', u'test')])), (u'node3', OrderedDict([(u'@id', u'test'), ('#text', u'test')]))]))])
<?xml version="1.0" encoding="utf-8"?>
<node1>
<node2 id="test">test</node2>
<node3 id="test">test</node3>
</node1>
We can see here that the CDATA is missing in the generated node2, and also node2 is the same as node3. However, in the input the nodes are different.
Regards
I finally managed to get it working by performing this monkey-patch. I am still not very happy with it, It's really a 'hack' this feature should be included somewhere properly:
import xmltodict
def escape_hacked(data, entities={}):
if data[0] == '<' and data.strip()[-1] == '>':
return '<![CDATA[%s]]>' % data
return escape_orig(data, entities)
xml.sax.saxutils.escape = escape_hacked
and then run your python code normally:
data = """<node1>
<node2 id='test'><![CDATA[test]]></node2>
<node3 id='test'>test</node3>
</node1>"""
data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data
print xmltodict.unparse(data, pretty=True)
To explain, the following line detect if the data is a valid XML, then it add the CDATA tag arround it:
if data[0] == '<' and data.strip()[-1] == '>':
return '<![CDATA[%s]]>' % data
Regards