With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:
import codecs
with open('temp.txt','w') as temp:
temp.write(codecs.BOM_UTF16_LE)
text = unichr(33034) # text = u'\u810a'
temp.write(text.encode('utf-16-le'))
But either of these things, when replaced above, make the code work.
unichr(33033) and unichr(33035) work correctly.
'utf-8' encoding (without BOM, byte-order mark).
How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?
You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A
, that is interpreted as a line break and does not make it to the file correctly.
Open the file in binary mode instead:
open('temp.txt','wb')