I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :
def encodeXMLText(text):
text = text.replace("&", "&")
text = text.replace("\"", """)
text = text.replace("'", "'")
text = text.replace("<", "<")
text = text.replace(">", ">")
text = text.replace("\n", "
")
text = text.replace("\r", "
")
return text
This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = encodeXMLText(titleText)
The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters
I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = titleText.decode('UTF-8')