Search code examples
pythonstringsplitnon-english

python split string into strings with same language characters


I want split strings like "hiسلامaliعلی" into ["hi", "سلام", "ali", "علی"].

the initial string contains only english and persian characters (with or without space) and I want to split it into continues same language characters.

is there an easy way to extract continues english character from string and split remaingin characters?


Solution

  • You can split on ASCII letters with re.split():

    re.split(r'([a-zA-Z]+)', inputstring)
    

    Demo with Python 3:

    >>> inputstring = "hiسلامaliعلی"
    >>> re.split(r'([a-zA-Z]+)', inputstring)
    ['', 'hi', 'سلام', 'ali', 'علی']
    

    Extending this to the full Latin-1 range:

    re.split(r'([a-zA-Z\xC0-\xFF]+)', inputstring)
    

    For Python 2, do make sure you use unicode strings and prefix the regular expression with u:

    re.split(ur'([a-zA-Z\xC0-\xFF]+)', inputstring)
    

    In all cases, if the Latin text is at the start or end, an empty string is inserted as the string is split; you can remove these with:

    result = [s for s in re.split(r'([a-zA-Z\xC0-\xFF]+)', inputstring) if s]