I try to write a function that simply splits a string by any symbol that is not a letter or a number. But I need to use cyrillic and when I do that I get output list with elements like '\x0d' instead of not latin words.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
class Syntax():
def __init__(self, string):
self.string = string.encode('utf-8')
self.list = None
def split(self):
self.list = re.split(ur"\W+", self.string, flags=re.U)
if __name__ == '__main__':
string = ur"Привет, мой друг test words."
a = Syntax(string)
a.split()
print a.string, a.list
Console output:
Привет, мой друг test words.
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xbc\xd0\xbe\xd0\xb9', '\xd0', '\xd1', '\xd1', '\xd0\xb3', 'test', 'words', '']
Thanks for your help.
There are two problems here:
You're coercing unicode to string in your Syntax constructor. In general you should leave text values as unicode. (self.string = string, no encoding).
When you print a Python list it's calling repr on the elements, causing the unicode to be coerced to those values. If you do
for x in a.list:
print x
after making the first change, it'll print Cyrillic.
Edit: printing a list calls repr on the elements, not string. However, printing a string doesn't repr it - print x and print repr(x) yield different values. For strings, the repr is always something you can evaluate in Python to recover the value.