python utf-8 special-characters mojibake

Python: increment special character Í

I want to read some words from an excel file and extracte some information. Reading the file is no problem.

The point is, that I want to increment the last character of a word. It is no problem for normal characters like 'A'. But special Characters like 'Í' are a problem.

I read the content with this:

val = val.encode('utf-8')

I put this value in a dictionary.

The next step is to iterate through the dict and get the saved information. info['streettype'] contains my val from before. Now i convert the value to upper case like this:

w2 = info['streettype'].decode('utf-8').upper().encode('utf-8')

That is needed because some characters are special, like I said (e.g. 'é', 'ž', 'í'). Now I want to increment the last character of the word, which can be a special character.

w3 = w2.decode('utf-8')[:-1].encode('utf-8')    
lastLetter = w2.decode('utf-8')[-1].encode('utf-8')

Now I increment the character by using:

lastLetter2 = (chr(ord(lastLetter.decode('utf-8')) + 1))

Next I want to save it in a text file. I want to save the original word and the edited word. I think I need to reencode my lastLetter2, but it does not work. When I just save my w2 and w3+lastLetter2 I get strange results because some are encoded, some are not.

For the word:

NÁBŘEŽÍ

my Result is:

"NÃBÅ˜EÅ½Ã", "NÃBÅ˜EÅ½ÎÃ"

but I want:

"NÁBŘEŽÍ", "NÁBŘEŽÎ"

(Í is ascii 205, Î is ascii 206)

Can someone help me to save this problem?

Solution

Stop encoding your data to UTF-8 all the time; keep your text as Unicode, it makes processing much easier. Leave encoding to the last minute, preferably by having the file object encode this for you.

Having the file encode Unicode means that in Python 2 you'd use io.open() rather than the standard built-in open() function; this is the same infrastructure Python 3 uses to handle Unicode and file I/O.

You managed to create a Mojibake by encoding and decoding at will here; your text is now a mix of UTF-8 data decoded with Windows codepage 1252 then encoded to UTF8 again, plus non-mangled data:

>>> print u'NÃBÅ˜EÅ½Ã'
NÃBÅ˜EÅ½Ã
>>> print u'NÃBÅ˜EÅ½Ã'[3:-1].encode('cp1252').decode('utf8')
ŘEŽ

Note that the last character in the first stringis invalid; it is missing a byte! That's because the result of 'decoding' the last character's UTF-8 bytes should not have been possible in a proper CP1252 codec; I had to use the ftfy project internal repair codecs to bypass that problem:

>>> print u'NÃBÅ˜EÅ½Ã\x8d'[3:].encode('sloppy-cp1252').decode('utf8')
ŘEŽÍ
>>> u'Í'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1: character maps to <undefined>
>>> u'Í'.encode('utf8').decode('sloppy-cp1252')
u'\xc3\x8d'
>>> print u'Í'.encode('utf8').decode('sloppy-cp1252')
Ã

The only way to fix this is to a) ensure you read your data using the correct codecs, and b) then treat all text as Unicode throughout your code, and only encode at the last moment to the correct output codec.

Handling Unicode code points with ord() and unichr() (in Python 2) and chr() in Python 3 will then work as expected:

>>> lastletter = u'Î'
>>> ord(lastletter)
206
>>> unichr(ord(lastletter) + 1)
u'\xcf'
>>> print unichr(ord(lastletter) + 1)
Ï

You may want to read up on Python and Unicode:

Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO