Search code examples
python-2.7unicodenltkpython-unicode

Getting Python to print in Cyrillic in a feature extractor


I'm trying to train a program to learn to determine whether a newly given word in Russian is a noun or a verb.

def POS_features(word):
    return{'three_last_letters':word[-3:]}
print(POS_features(u'Богатир'))

Which returns {'three_last_letters': u'\u0442\u0438\u0440'}

Despite the

u'Богатир'

the last three letters print gobbledy-gook. How can I get Python to print in Cyrillic?


Solution

  • Your function returns a dict and that's what was printed. Containers frequently print their repr - that is, a python-like representation of their contents. If you process the dict yourself, you get the right value.

    >>> def POS_features(word):
    ...     return{'three_last_letters':word[-3:]}
    ... 
    >>> val = POS_features(u'Богатир')
    \>>> for k,v in val.items():
    ...     print k, v
    ... 
    three_last_letters тир
    

    I pasted your printed result back into my shell and got a dict again. Its not guaranteed that a string representation of a object can be built back into an object, but it works for simple types.

    >>> val = {'three_last_letters': u'\u0442\u0438\u0440'}
    >>> val
    {'three_last_letters': u'\u0442\u0438\u0440'}