Search code examples
pythonregexpattern-matchinglanguage-specifications

Python pattern matching with language-specific characters


From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:

import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")

Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:

['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']

I want it to yield:

['Sektion', 'München', 'Gruppe', 'Süd']

I am grateful for suggestions how to solve this problem.


Solution

  • You may use

    import re
    p = re.compile(r'[^\W\d_]+')
    print(p.findall("02_Sektion_München_Gruppe_Süd"))
    # => ['Sektion', 'München', 'Gruppe', 'Süd']
    

    See the Python 3 demo.

    The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.

    In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:

    p = re.compile(r'[^\W\d_]+', re.U)