Search code examples
pythonregexcyrillic

regex for words does not containing Cyrillic letters


I'd like to clean the string from any words, which does not contain at least one Cyrillic letter (by words I mean parts of string split by whitespace char)

I've tried line = re.sub(' *^[^а-яА-Я]+ *', ' ', line) where [а-яА-Я] is set of cyrrilic letters, but when processing string

 des поместья, de la famille Buonaparte. Non, je vous préviens que si vous

it returns

поместья, de la famille Buonaparte. Non, je vous préviens que si vous

instead of оf just

поместья

Solution

  • One option is to match 1 or more occurrences of characters that are not in the range а-яА-Я and also exclude matching whitespace characters adding [^а-яА-Я\s]+

    The negative lookarounds (?<!\S) and (?!\S) assert whitespace boundaries to the left and to the right.

    When replacing with an empty string, there could be double spaced gaps, that you would have to replace with a single space.

    If you don't want to match the trailing comma, you can use strip and add the characters that you want to remove.

    See a regex demo for the matches.

    For example:

    import re
    
    s = " des поместья, de la famille Buonaparte. Non, je vous pr&#233;viens que si vous"
    pattern = r"(?<!\S)[^а-яА-Я\s]+(?!\S)"
    print(re.sub(pattern, "", s).strip(', '))
    

    Output

    поместья