Search code examples
pythonsplitwords

How to extract words from a text using python?


I need to extract the words and phrases within a text. For example, the text is:

Привет, hello, как дела? english word, еще одно русское слово, слово-1224, тест 4456

And script should return the following:

Привет
как
дела
еще
одно
русское
слово
слово-1224

That is, I need to take from the text of all the words that begin with the Russian letters ([а-яА-Яё-]), and can contain numbers and letters of the Russian alphabet. How is this implemented?


Solution

  • It was a little bit trickier than I thought. Have never used cyrrilic chars. I do believe this should do:

    text =  # Set you're input unicode string here.
    words = re.findall('[\p{IsCyrillic}][0-9\p{IsCyrillic}]+', text)
    
    for word in words:
        print word