Search code examples
pythonunicodeutf-8utf-16utf-16le

Writing unicode with python - what is wrong with this character


With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:

import codecs
with open('temp.txt','w') as temp:
    temp.write(codecs.BOM_UTF16_LE)     
    text = unichr(33034)  # text = u'\u810a'
    temp.write(text.encode('utf-16-le'))

But either of these things, when replaced above, make the code work.

  1. unichr(33033) and unichr(33035) work correctly.

  2. 'utf-8' encoding (without BOM, byte-order mark).

How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?


Solution

  • You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.

    Open the file in binary mode instead:

    open('temp.txt','wb')