Search code examples
pythonunicodemojibake

Writing unicode type objects to file in Python


I'm trying to write unicode strings to a file in Python but when I read the file using linux "cat" or "less" the correct characters are not written, instead they show up as garbage.

I am reading the object from an Oracle database. When I print the type (where a is a row in the database results):

logger.debug(type(a[index])) 

it outputs:

<type 'unicode'>

I open the file for writing like so:

ff = codecs.open(filename, mode='w', encoding='utf-8')

and I write the line to the file like:

ff.write(a[index]))

but when I read the output file, it doesn't show the correctly accented characters but garbage instead:

$Bu��rger, Udo, -1985. Way to perfect horsemanship

How do I correctly write unicode string objects to a file in Python?


Solution

  • I can guess at how you arrived at that Mojibake of a string. It is quite involved, I am impressed how mucked up this got to be.

    Something decoded text from bytes to Unicode with error='replace', masking the fact the wrong codec was used as as bytes that weren't recognized were replaced with replacement characters.

    The resulting Unicode text with U+FFFD REPLACEMENT CHARACTER codepoints was then encoded to UTF-8, but decoded them again as Latin 1, most likely by your terminal as cat or les output the raw bytes.

    The text encoded this way is:

    >>> print u'$Bu��rger, Udo, -1985. Way to perfect horsemanship'.encode('latin1').decode('utf8')
    $Bu��rger, Udo, -1985. Way to perfect horsemanship
    

    Presumably this was meant to be Bürger, Udo, - 1985. Way to perfect horsemanship, with the ü being formed by the character u and the U+0308 COMBINING DIAERESIS codepoint, which would have been CC 88 in UTF-8, but not decodable as ASCII:

    >>> text = u'Bu\u0308rger, Udo, - 1985. Way to perfect horsemanship'
    >>> print text
    Bürger, Udo, - 1985. Way to perfect horsemanship
    >>> text.encode('utf8')
    'Bu\xcc\x88rger, Udo, - 1985. Way to perfect horsemanship'
    >>> text.encode('utf8').decode('ascii', errors='replace')
    u'Bu\ufffd\ufffdrger, Udo, - 1985. Way to perfect horsemanship'
    

    The moral of the story: Don't use errors='replace' unless you are absolutely sure what you are doing.