I am dealing with words written in Uzbek language. The language has the following letters:
alphabet = ["a", "b", "c", "d", "e", "f", "g", "g'", "h", "i",
"j", "k", "l", "m", "n", "ng", "o", "o'", "p", "q", "r",
"s", "sh", "t", "u", "v", "x", "y", "z"]
As you can see, there are letters with multiple characters like o'
, g'
and sh
. How can I split a word in this language into a list of Uzbek letters? So, for example, splitting the word "o'zbek"
into ["o'", "z", "b", "e", "k"]
.
If I do the following:
word = "o'zbek"
letters = list(word)
It results in:
['o', "'", 'z', 'b', 'e', 'k']
which is incorrect as o
and '
are not together.
I also tried using regex like this:
import re
expression = "|".join(alphabet)
re.split(expression, word)
But it results in:
['', "'", '', '', '', '']
To give priority to the more-than-one-character letters, first we sort the alphabet over the length of characters. Then pass it to a regex as you did with "|".join
, and re.findall
gives the list of splits:
import re
sorted_alphabet = sorted(alphabet, key=len, reverse=True)
regex = re.compile("|".join(sorted_alphabet))
def split_word(word):
return re.findall(regex, word)
using:
>>> split_word("o'zbek")
["o'", 'z', 'b', 'e', 'k']
>>> split_word("asha")
['a', 'sh', 'a']