Search code examples
pythonxmlencodingyattag

How to output CDATA using yattag library


I'm trying with the following code to generate an XML file contains tags </documents>.

string = "dasdd Wonder asdf new single, “Tomorrow” #URL# | " \
    "oiojk asfddsf releases new asdfdf, “gfsg” | " \
    "Identity of asfqw who dasd off asdfsdf Mainland jtyjyjui revealed #URL#"

from yattag import Doc, indent
import html, re

doc, tag, text = Doc().tagtext()
with tag('author', lang='en'):
    with tag('documents'):
        for tweet in string.split(' | '):
            with tag('document'):
                tweet = html.unescape(tweet)
                text('<![CDATA[{}]]'.format(tweet))
result = indent(doc.getvalue(), indentation=' ' * 4, newline='\n')
with open('test.xml', 'w', encoding='utf-8') as f:
    f.write(result)

I wanted to add CDATA token around the text, but when I open the generated file using Notepad++ instead of have the output as:

<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]></document>

it appears like (with HTML entities):

<document>&lt;![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]</document>

I tried to use HTML library (html.unescape line) to discard the HTML entities but I wasn't able.

How can I solve this encoding issue?


Solution

  • The text method always replaces '<' with &lt;. If you wanted no escaping of that kind, you would use the asis method instead (it inserts the string "as is"). But, in your case, it would be more appropriate to use Yattag's cdata method.

    from yattag import Doc
    help(Doc.cdata)
    

    cdata(self, strg, safe=False) appends a CDATA section containing the supplied string.

    You don't have to worry about potential ]]> sequences that would terminate the CDATA section. They are replaced with ]]]]><![CDATA[>.

    If you're sure your string does not contain ]]>, you can pass safe = True. If you do that, your string won't be searched for ]]> sequences.

    So, in your case, you can do:

    for tweet in string.split(' | '):
        with tag('document'):
            tweet = html.unescape(tweet)
            doc.cdata(tweet)