Can you please help me to understand the following line of the code:
import re
a= re.findall('[А-Яа-я-\s]+', string)
I am a bit confused with the pattern that has to be found in the string. Particularly, a string should start with A
and end with any string in-between A
and я
, should be separated by -
and space, but what does the second term Яа
stand for?
[ ] any of the characters in here
А-Я any character from А and Я, inclusive
а-я any character between а and я, inclusive
- the character - (this is ambiguous; it should only be at the very start or end of the class)
\s any whitespace character
+ at least one of the preceding class of characters
[А-Яа-я-\s]+ at least one character between А and Я (uppercase or lowercase), a dash, or whitespace
the []
is called a "class" in regex, and it's basically meant to say "any of the characters inside here is valid". And then +
means "at least one occurrence of the preceding character/class".
Python has a Regular Expressions HowTo that you might find useful to read through.