Writing unicode type objects to file in Python

I'm trying to write unicode strings to a file in Python but when I read the file using linux "cat" or "less" the correct characters are not written, instead they show up as garbage.

I am reading the object from an Oracle database. When I print the type (where a is a row in the database results):

logger.debug(type(a[index]))

it outputs:

<type 'unicode'>

I open the file for writing like so:

ff = codecs.open(filename, mode='w', encoding='utf-8')

and I write the line to the file like:

ff.write(a[index]))

but when I read the output file, it doesn't show the correctly accented characters but garbage instead:

$Buï¿½ï¿½rger, Udo, -1985. Way to perfect horsemanship

How do I correctly write unicode string objects to a file in Python?

Solution

I can guess at how you arrived at that Mojibake of a string. It is quite involved, I am impressed how mucked up this got to be.

Something decoded text from bytes to Unicode with error='replace', masking the fact the wrong codec was used as as bytes that weren't recognized were replaced with replacement characters.

The resulting Unicode text with U+FFFD REPLACEMENT CHARACTER codepoints was then encoded to UTF-8, but decoded them again as Latin 1, most likely by your terminal as cat or les output the raw bytes.

The text encoded this way is:

>>> print u'$Buï¿½ï¿½rger, Udo, -1985. Way to perfect horsemanship'.encode('latin1').decode('utf8')
$Bu��rger, Udo, -1985. Way to perfect horsemanship

Presumably this was meant to be Bürger, Udo, - 1985. Way to perfect horsemanship, with the ü being formed by the character u and the U+0308 COMBINING DIAERESIS codepoint, which would have been CC 88 in UTF-8, but not decodable as ASCII:

>>> text = u'Bu\u0308rger, Udo, - 1985. Way to perfect horsemanship'
>>> print text
Bürger, Udo, - 1985. Way to perfect horsemanship
>>> text.encode('utf8')
'Bu\xcc\x88rger, Udo, - 1985. Way to perfect horsemanship'
>>> text.encode('utf8').decode('ascii', errors='replace')
u'Bu\ufffd\ufffdrger, Udo, - 1985. Way to perfect horsemanship'

The moral of the story: Don't use errors='replace' unless you are absolutely sure what you are doing.