Search code examples
pythonregexpython-re

Why my regex matches non-ascii characters?


I'm trying to filter anything except alphanumeric characters, Russian letters, line breaks, spaces, commas, dots, question marks, exclamation marks, slashes, #, @, colons and parentheses.

My regex so far: r"[^А-я\w\d\n ,.?!Ρ‘/@#:()]"

However, it does not match the following string: "π•Ύπ–π–Žπ–—π–”π–“". Why not, and how can I make it do so?

Edit: Forgot to mention that it works as expected at https://regexr.com/


Solution

  • You may check the string at this link and you will see that the "π•Ύπ–π–Žπ–—π–”π–“" string consists of characters belonging to \p{L} category. Your regex starts with [^А-я\w\d, which means it matches any chars but Russian chars (except Ρ‘ (that you define a bit later) and Ё), any Unicode letters (any because in Python 3, \w - by default - matches any Unicode alphanumeric chars and connector punctuation.

    It appears you only want to remove Russian and English letters, so use the corresponding ranges:

    r"[^А-ЯЁа-яёA-Za-z0-9\n ,.?!/@#:()]+"
    

    It matches one or more chas other than

    • А-ЯЁа-яё - Russian letters
    • A-Za-z - ASCII letters
    • 0-9 - ASCII digits
    • \n ,.?!/@#:() - newline, space, comma, dot, question and exclamation marks, slash, ampersand, hash, colon and round parentheses.