I'm trying to write unicode strings to a file in Python but when I read the file using linux "cat" or "less" the correct characters are not written, instead they show up as garbage.
I am reading the object from an Oracle database. When I print the type (where a is a row in the database results):
logger.debug(type(a[index]))
it outputs:
<type 'unicode'>
I open the file for writing like so:
ff = codecs.open(filename, mode='w', encoding='utf-8')
and I write the line to the file like:
ff.write(a[index]))
but when I read the output file, it doesn't show the correctly accented characters but garbage instead:
$Bu��rger, Udo, -1985. Way to perfect horsemanship
How do I correctly write unicode string objects to a file in Python?
I can guess at how you arrived at that Mojibake of a string. It is quite involved, I am impressed how mucked up this got to be.
Something decoded text from bytes to Unicode with error='replace'
, masking the fact the wrong codec was used as as bytes that weren't recognized were replaced with replacement characters.
The resulting Unicode text with U+FFFD REPLACEMENT CHARACTER codepoints was then encoded to UTF-8, but decoded them again as Latin 1, most likely by your terminal as cat
or les
output the raw bytes.
The text encoded this way is:
>>> print u'$Bu��rger, Udo, -1985. Way to perfect horsemanship'.encode('latin1').decode('utf8')
$Bu��rger, Udo, -1985. Way to perfect horsemanship
Presumably this was meant to be Bürger, Udo, - 1985. Way to perfect horsemanship, with the ü
being formed by the character u
and the U+0308 COMBINING DIAERESIS codepoint, which would have been CC 88 in UTF-8, but not decodable as ASCII:
>>> text = u'Bu\u0308rger, Udo, - 1985. Way to perfect horsemanship'
>>> print text
Bürger, Udo, - 1985. Way to perfect horsemanship
>>> text.encode('utf8')
'Bu\xcc\x88rger, Udo, - 1985. Way to perfect horsemanship'
>>> text.encode('utf8').decode('ascii', errors='replace')
u'Bu\ufffd\ufffdrger, Udo, - 1985. Way to perfect horsemanship'
The moral of the story: Don't use errors='replace'
unless you are absolutely sure what you are doing.