I got a raw utf-8 string:
u'1\u670d-\u82f1\u96c4\u96c6\u7ed3'
And I convert it to string
s = str(u'1\u670d-\u82f1\u96c4\u96c6\u7ed3'.encode('utf8'))
print s
'1\xe6\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93'
I want to separate that number 1 with the rest of the strings.
Then I tried:
s.split('\\')
['1\xe6\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93']
s.split('\\x')
['1\xe6\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93']
Not what I thought it will be.
Finally a sparkle comes to my mind, I did:
s.split('\xe6')
['1', '\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93']
But the problem is, I can't ensure the utf-8 code in other such combinations would start with '\xe6', so I need a method to distinguish a number with arbitrary utf-8 code and then split them apart.
Is it possible to do that?
If it’s always a single digit, just index the first item:
digit = s[0]
Otherwise, you could use a regular expression to scan it:
number = re.match(r'^\d+', s).group(0)