I'm trying to filter anything except alphanumeric characters, Russian letters, line breaks, spaces, commas, dots, question marks, exclamation marks, slashes, #
, @
, colons and parentheses.
My regex so far: r"[^Π-Ρ\w\d\n ,.?!Ρ/@#:()]"
However, it does not match the following string: "πΎπππππ"
.
Why not, and how can I make it do so?
Edit: Forgot to mention that it works as expected at https://regexr.com/
You may check the string at this link and you will see that the "πΎπππππ" string consists of characters belonging to \p{L}
category. Your regex starts with [^Π-Ρ\w\d
, which means it matches any chars but Russian chars (except Ρ
(that you define a bit later) and Π
), any Unicode letters (any because in Python 3, \w
- by default - matches any Unicode alphanumeric chars and connector punctuation.
It appears you only want to remove Russian and English letters, so use the corresponding ranges:
r"[^Π-Π―ΠΠ°-ΡΡA-Za-z0-9\n ,.?!/@#:()]+"
It matches one or more chas other than
Π-Π―ΠΠ°-ΡΡ
- Russian lettersA-Za-z
- ASCII letters0-9
- ASCII digits\n ,.?!/@#:()
- newline, space, comma, dot, question and exclamation marks, slash, ampersand, hash, colon and round parentheses.