python unicode string-conversion mojibake

unprintable python unicode string

I retrieved some exif info from an image and got the following:

{ ...
37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'
...}

I expected it to be

{ ...
37510: u'D2\nArbeitsamt\nÄnderungsbescheid'
... }

I need to convert the value to a str, but i couldn't manage it to work. I always get something like (using python27)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)

Any ideas how I can handle this?

UPDATE:

I tried it with python3 and there is now error thrown, but the result is now

{ ...
37510: 'D2\nArbeitsamt\nÃ\x84nderungsbescheid',
... }

which is still not the expected.

Solution

It seems to be utf8 which was incorrectly decoded as latin1 and then placed in a unicode string. You can use .encode('iso8859-1') to reverse the incorrect decoding.

>>> my_dictionary = {37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'}
>>> print(my_dictionary[37510].encode('iso8859-1'))
D2
Arbeitsamt
Änderungsbescheid

You can print it out just fine now, but you might then also decode it as unicode, so it ends up with the correct type for further processing:

>>> type(my_dictionary[37510].encode('iso8859-1'))
<type 'str'>
>>> print(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
D2
Arbeitsamt
Änderungsbescheid
>>> type(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
<type 'unicode'>