Search code examples
pythonpython-2.7unicodeunicode-stringpython-unicode

why I can't swap unicode characters in python


Why I can't swap unicode characters in code?

# -*- coding: utf-8 -*-

character_swap = {'ą': 'a', 'ż': 'z', 'ó': 'o'}

text = 'idzie wąż wąską dróżką'

print text

print ''.join(character_swap.get(ch, ch) for ch in text)

OUTPUT: idzie wąż wąską dróżką

EXPECTED OUTPUT: idzie waz waska drozka


Solution

  • You need to encode you text first then decode the characters again :

    >>> print ''.join(character_swap.get(ch.encode('utf8'), ch) for ch in text.decode('utf8'))
    idzie waz waska drozka
    

    Its because of that python list comprehension doesn't encode your unicode by default,actually what you are doing her is :

    >>> [i for i in text]
    ['i', 'd', 'z', 'i', 'e', ' ', 'w', '\xc4', '\x85', '\xc5', '\xbc', ' ', 'w', '\xc4', '\x85', 's', 'k', '\xc4', '\x85', ' ', 'd', 'r', '\xc3', '\xb3', '\xc5', '\xbc', 'k', '\xc4', '\x85']
    

    And for a character like ą we have :

    >>> 'ą'
    '\xc4\x85'
    

    As you can see within a list comprehension python split it in 2 part \xc4 and \x85. so for getting ride of that you can first decode your text by utf8 encocding :

    >>> [i for i in text.decode('utf8')]
    [u'i', u'd', u'z', u'i', u'e', u' ', u'w', u'\u0105', u'\u017c', u' ', u'w', u'\u0105', u's', u'k', u'\u0105', u' ', u'd', u'r', u'\xf3', u'\u017c', u'k', u'\u0105']