Search code examples
pythonregexutf-8cyrillic

Python: re.split() display cyrillic result


I try to write a function that simply splits a string by any symbol that is not a letter or a number. But I need to use cyrillic and when I do that I get output list with elements like '\x0d' instead of not latin words.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

class Syntax():
    def __init__(self, string):
        self.string = string.encode('utf-8')
        self.list = None

    def split(self):
        self.list = re.split(ur"\W+", self.string, flags=re.U)

if __name__ == '__main__':  
    string = ur"Привет, мой друг test words."
    a = Syntax(string)
    a.split()
    print a.string, a.list

Console output:

Привет, мой друг test words.
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xbc\xd0\xbe\xd0\xb9', '\xd0', '\xd1', '\xd1', '\xd0\xb3', 'test', 'words', ''] 

Thanks for your help.


Solution

  • There are two problems here:

    1. You're coercing unicode to string in your Syntax constructor. In general you should leave text values as unicode. (self.string = string, no encoding).

    2. When you print a Python list it's calling repr on the elements, causing the unicode to be coerced to those values. If you do

      for x in a.list:
          print x
      

    after making the first change, it'll print Cyrillic.

    Edit: printing a list calls repr on the elements, not string. However, printing a string doesn't repr it - print x and print repr(x) yield different values. For strings, the repr is always something you can evaluate in Python to recover the value.