Search code examples
pythonxmlpython-2.7lxmlcontrol-characters

Python XML Compatible String


I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :

def encodeXMLText(text):
    text = text.replace("&",  "&")
    text = text.replace("\"", """)
    text = text.replace("'",  "'")
    text = text.replace("<",  "&lt;")
    text = text.replace(">",  "&gt;")
    text = text.replace("\n", "&#xA;")
    text = text.replace("\r", "&#xD;")
    return text

This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:

  ruleTitle = ET.SubElement(rule,'title')
  ruleTitle.text = encodeXMLText(titleText)

The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters


Solution

  • I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:

    ruleTitle = ET.SubElement(rule,'title')
    ruleTitle.text = titleText.decode('UTF-8')