Search code examples
pythonunicodestring-conversionmojibake

unprintable python unicode string


I retrieved some exif info from an image and got the following:

{ ...
37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'
...}

I expected it to be

{ ...
37510: u'D2\nArbeitsamt\nÄnderungsbescheid'
... }

I need to convert the value to a str, but i couldn't manage it to work. I always get something like (using python27)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)

Any ideas how I can handle this?

UPDATE:

I tried it with python3 and there is now error thrown, but the result is now

{ ...
37510: 'D2\nArbeitsamt\nÃ\x84nderungsbescheid',
... }

which is still not the expected.


Solution

  • It seems to be utf8 which was incorrectly decoded as latin1 and then placed in a unicode string. You can use .encode('iso8859-1') to reverse the incorrect decoding.

    >>> my_dictionary = {37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'}
    >>> print(my_dictionary[37510].encode('iso8859-1'))
    D2
    Arbeitsamt
    Änderungsbescheid
    

    You can print it out just fine now, but you might then also decode it as unicode, so it ends up with the correct type for further processing:

    >>> type(my_dictionary[37510].encode('iso8859-1'))
    <type 'str'>
    >>> print(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
    D2
    Arbeitsamt
    Änderungsbescheid
    >>> type(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
    <type 'unicode'>