From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:
import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")
Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:
['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']
I want it to yield:
['Sektion', 'München', 'Gruppe', 'Süd']
I am grateful for suggestions how to solve this problem.
You may use
import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']
See the Python 3 demo.
The [^\W\d_]+
pattern matches any 1+ chars that are not non-word, digits and _
, that is, that are only letters.
In Python 2.x you will have to add re.UNICODE
flag to make it match Unicode letters:
p = re.compile(r'[^\W\d_]+', re.U)