I'm using Python 3.6.4 with lxml 4.1.1. When reading/parsing the etree, I escape 12 Unicode separator characters. The PSEP looks like this:
line = line.replace('\u2029', ' %(#u2029)s '
After a lot of filtering/processing, I save the line to a new XML file with this code:
seg = etree.SubElement(tuv, 'seg')
seg.text = line.replace('%(#u2029)s', '\u2029')
Which produces the following traceback:
Traceback (most recent call last):
File "C:\process-tmx\", line 267, in run
seg.text = line.replace('%(#u2029)s', '\u2029')
File "src\lxml\xtree.pyx", line 1033, in lxml.etree._Element.text.__set__ (src\lxml\etree.c:55075)
File "src\lxml\apihelpers.pxi", line 716, in lxml.etree._setNodeText (src\lxml\etree.c:25862)
File "src\lxml\apihelpers.pxi", line 704, in lxml.etree._createTextNode (src\lxml\etree.c:25725)
File "src\lxml\apihelpers.pxi", line 1444, in lxml.etree._utf8(src\lxml\etree.c:32944)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
Does this mean that '\u2029' is an XML incompatible Unicode? How do I XML escape it?
Thanks
In the Unicode in XML and other Markup Languages documentation, there's a section called Characters not Suitable for Use in Markup. This section does not actually mandate that U+2029 be illegal in XML, but it says that its use is discouraged.
Read the whole section for details, but the short version goes like this:
If you're actually using it as a paragraph separator, you should instead use the paragraph separator for your specific XML language. The example in the documentation is <xhtml:br />
or <xhtml:p></xhtml:p>
for XHTML.
If you're just using it as a character in the middle of some non-XML text you're cramming into a field in an XML document, you will want to escape it. How? Well, if you're writing both the creating and consuming code, you can escape it however you want, as long as you can unescape it on the other end. If someone else is writing the consuming code, you have to produce whatever they're expecting. If the consuming code is going to be general-purpose (like displaying raw XML in Firefox), then you'll want it to be something that's readable to the end-user.
For the last case, you might, in fact, want to just use U+2029, despite it being "discouraged". But it looks like lxml
won't let you do that, because it's being stricter than necessary. That isn't too unreasonable (you know, strict-in-what-you-produce-liberal-in-what-you-consume and all that), but if you have a use case where it's annoying, it's still annoying. In which case, you need find a way to override what it does—if there's no config setting, something like leaving it encoded all the way through lxml
and then transforming it after lxml
is done with it, right before you write it to a file/socket/whatever.