Search code examples
pythonregexstringpython-re

Split a string into a list by a set of strings


I am dealing with words written in Uzbek language. The language has the following letters:

alphabet = ["a", "b", "c", "d", "e", "f", "g", "g'", "h", "i", 
    "j", "k", "l", "m", "n", "ng", "o", "o'", "p", "q", "r", 
    "s", "sh", "t", "u", "v", "x", "y", "z"]

As you can see, there are letters with multiple characters like o', g' and sh. How can I split a word in this language into a list of Uzbek letters? So, for example, splitting the word "o'zbek" into ["o'", "z", "b", "e", "k"].

If I do the following:

word = "o'zbek"
letters = list(word)

It results in:

['o', "'", 'z', 'b', 'e', 'k']

which is incorrect as o and ' are not together.

I also tried using regex like this:

import re
expression = "|".join(alphabet)
re.split(expression, word)

But it results in:

['', "'", '', '', '', '']

Solution

  • To give priority to the more-than-one-character letters, first we sort the alphabet over the length of characters. Then pass it to a regex as you did with "|".join, and re.findall gives the list of splits:

    import re
    
    sorted_alphabet = sorted(alphabet, key=len, reverse=True)
    regex = re.compile("|".join(sorted_alphabet))
    
    def split_word(word):
        return re.findall(regex, word)
    

    using:

    >>> split_word("o'zbek")
    ["o'", 'z', 'b', 'e', 'k']
    
    >>> split_word("asha")
    ['a', 'sh', 'a']