I'm trying with the following code to generate an XML file contains tags </documents>
.
string = "dasdd Wonder asdf new single, “Tomorrow” #URL# | " \
"oiojk asfddsf releases new asdfdf, “gfsg” | " \
"Identity of asfqw who dasd off asdfsdf Mainland jtyjyjui revealed #URL#"
from yattag import Doc, indent
import html, re
doc, tag, text = Doc().tagtext()
with tag('author', lang='en'):
with tag('documents'):
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
text('<![CDATA[{}]]'.format(tweet))
result = indent(doc.getvalue(), indentation=' ' * 4, newline='\n')
with open('test.xml', 'w', encoding='utf-8') as f:
f.write(result)
I wanted to add CDATA
token around the text, but when I open the generated file using Notepad++
instead of have the output as:
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]></document>
it appears like (with HTML entities):
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]</document>
I tried to use HTML
library (html.unescape
line) to discard the HTML entities but I wasn't able.
How can I solve this encoding issue?
The text
method always replaces '<' with <
. If you wanted no escaping of that kind, you would use the asis
method instead (it inserts the string "as is"). But, in your case, it would be more appropriate to use Yattag's cdata
method.
from yattag import Doc
help(Doc.cdata)
cdata(self, strg, safe=False) appends a CDATA section containing the supplied string.
You don't have to worry about potential ]]>
sequences that would terminate
the CDATA section. They are replaced with ]]]]><![CDATA[>
.
If you're sure your string does not contain ]]>
, you can pass safe = True
.
If you do that, your string won't be searched for ]]>
sequences.
So, in your case, you can do:
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
doc.cdata(tweet)