python regex pattern-matching language-specifications

Python pattern matching with language-specific characters

From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:

import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")

Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:

['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']

I want it to yield:

['Sektion', 'München', 'Gruppe', 'Süd']

I am grateful for suggestions how to solve this problem.

Solution

You may use

import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']

See the Python 3 demo.

The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.

In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:

p = re.compile(r'[^\W\d_]+', re.U)