Why I can't swap unicode characters in code?
# -*- coding: utf-8 -*-
character_swap = {'ą': 'a', 'ż': 'z', 'ó': 'o'}
text = 'idzie wąż wąską dróżką'
print text
print ''.join(character_swap.get(ch, ch) for ch in text)
OUTPUT: idzie wąż wąską dróżką
EXPECTED OUTPUT: idzie waz waska drozka
You need to encode you text first then decode the characters again :
>>> print ''.join(character_swap.get(ch.encode('utf8'), ch) for ch in text.decode('utf8'))
idzie waz waska drozka
Its because of that python list comprehension doesn't encode your unicode by default,actually what you are doing her is :
>>> [i for i in text]
['i', 'd', 'z', 'i', 'e', ' ', 'w', '\xc4', '\x85', '\xc5', '\xbc', ' ', 'w', '\xc4', '\x85', 's', 'k', '\xc4', '\x85', ' ', 'd', 'r', '\xc3', '\xb3', '\xc5', '\xbc', 'k', '\xc4', '\x85']
And for a character like ą
we have :
>>> 'ą'
'\xc4\x85'
As you can see within a list comprehension python split it in 2 part \xc4
and \x85
. so for getting ride of that you can first decode your text by utf8
encocding :
>>> [i for i in text.decode('utf8')]
[u'i', u'd', u'z', u'i', u'e', u' ', u'w', u'\u0105', u'\u017c', u' ', u'w', u'\u0105', u's', u'k', u'\u0105', u' ', u'd', u'r', u'\xf3', u'\u017c', u'k', u'\u0105']