python python-2.7 unicode unicode-string python-unicode

why I can't swap unicode characters in python

Why I can't swap unicode characters in code?

# -*- coding: utf-8 -*-

character_swap = {'ą': 'a', 'ż': 'z', 'ó': 'o'}

text = 'idzie wąż wąską dróżką'

print text

print ''.join(character_swap.get(ch, ch) for ch in text)

OUTPUT: idzie wąż wąską dróżką

EXPECTED OUTPUT: idzie waz waska drozka

Solution

You need to encode you text first then decode the characters again :

>>> print ''.join(character_swap.get(ch.encode('utf8'), ch) for ch in text.decode('utf8'))
idzie waz waska drozka

Its because of that python list comprehension doesn't encode your unicode by default,actually what you are doing her is :

>>> [i for i in text]
['i', 'd', 'z', 'i', 'e', ' ', 'w', '\xc4', '\x85', '\xc5', '\xbc', ' ', 'w', '\xc4', '\x85', 's', 'k', '\xc4', '\x85', ' ', 'd', 'r', '\xc3', '\xb3', '\xc5', '\xbc', 'k', '\xc4', '\x85']

And for a character like ą we have :

>>> 'ą'
'\xc4\x85'

As you can see within a list comprehension python split it in 2 part \xc4 and \x85. so for getting ride of that you can first decode your text by utf8 encocding :

>>> [i for i in text.decode('utf8')]
[u'i', u'd', u'z', u'i', u'e', u' ', u'w', u'\u0105', u'\u017c', u' ', u'w', u'\u0105', u's', u'k', u'\u0105', u' ', u'd', u'r', u'\xf3', u'\u017c', u'k', u'\u0105']