Search code examples
pythonpython-3.xregexunicodepython-re

How do I match all unicode lowercase characters in Python with a regular expression?


I am trying to write a regular expression that would match Unicode lowercase characters in Python 3. I'm using the re library. For example, re.findall(some_pattern, 'u∏ñKθ') should return ['u', 'ñ', 'θ'].

In Sublime Text, I could simply type [[:lower:]] to find these characters.

I'm aware that Python can match on any Unicode character with re.compile('[^\W\d_]'), but I specifically need to differentiate between uppercase and lowercase. I'm also aware that re.compile('[a-z]') would match any ASCII lowercase character, but my data is UTF-8, and it includes lots of non-ASCII characters—I checked.

Is this possible with regular expressions in Python 3, or will I need to take an alternative approach? I know other ways to do it. I was just hoping to use regex.


Solution

  • You can use the regex package if using a third party package is acceptable.

    >>> import regex
    >>> s = 'ABCabcÆæ'
    >>> m = regex.findall(r'[[:lower:]]', s)
    >>> m
    ['a', 'b', 'c', 'æ']