Search code examples
pythonstringutf-8splitmojibake

How to split utf-8 string with a number ahead it in python?


I got a raw utf-8 string:

u'1\u670d-\u82f1\u96c4\u96c6\u7ed3'

And I convert it to string

s = str(u'1\u670d-\u82f1\u96c4\u96c6\u7ed3'.encode('utf8'))
print s
'1\xe6\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93'    

I want to separate that number 1 with the rest of the strings.

Then I tried:

s.split('\\')
['1\xe6\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93']

s.split('\\x')
['1\xe6\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93']

Not what I thought it will be.

Finally a sparkle comes to my mind, I did:

s.split('\xe6')
['1', '\x9c\x8d-\xe8\x8b\xb1\xe9\x9b\x84\xe9\x9b\x86\xe7\xbb\x93']

But the problem is, I can't ensure the utf-8 code in other such combinations would start with '\xe6', so I need a method to distinguish a number with arbitrary utf-8 code and then split them apart.

Is it possible to do that?


Solution

  • If it’s always a single digit, just index the first item:

    digit = s[0]
    

    Otherwise, you could use a regular expression to scan it:

    number = re.match(r'^\d+', s).group(0)