I want split strings like "hiسلامaliعلی"
into ["hi", "سلام", "ali", "علی"]
.
the initial string contains only english and persian characters (with or without space) and I want to split it into continues same language characters.
is there an easy way to extract continues english character from string and split remaingin characters?
You can split on ASCII letters with re.split()
:
re.split(r'([a-zA-Z]+)', inputstring)
Demo with Python 3:
>>> inputstring = "hiسلامaliعلی"
>>> re.split(r'([a-zA-Z]+)', inputstring)
['', 'hi', 'سلام', 'ali', 'علی']
Extending this to the full Latin-1 range:
re.split(r'([a-zA-Z\xC0-\xFF]+)', inputstring)
For Python 2, do make sure you use unicode
strings and prefix the regular expression with u
:
re.split(ur'([a-zA-Z\xC0-\xFF]+)', inputstring)
In all cases, if the Latin text is at the start or end, an empty string is inserted as the string is split; you can remove these with:
result = [s for s in re.split(r'([a-zA-Z\xC0-\xFF]+)', inputstring) if s]