python python-3.x regex unicode python-re

How do I match all unicode lowercase characters in Python with a regular expression?

I am trying to write a regular expression that would match Unicode lowercase characters in Python 3. I'm using the re library. For example, re.findall(some_pattern, 'u∏ñKθ') should return ['u', 'ñ', 'θ'].

In Sublime Text, I could simply type [[:lower:]] to find these characters.

I'm aware that Python can match on any Unicode character with re.compile('[^\W\d_]'), but I specifically need to differentiate between uppercase and lowercase. I'm also aware that re.compile('[a-z]') would match any ASCII lowercase character, but my data is UTF-8, and it includes lots of non-ASCII characters—I checked.

Is this possible with regular expressions in Python 3, or will I need to take an alternative approach? I know other ways to do it. I was just hoping to use regex.

Solution

You can use the regex package if using a third party package is acceptable.

>>> import regex
>>> s = 'ABCabcÆæ'
>>> m = regex.findall(r'[[:lower:]]', s)
>>> m
['a', 'b', 'c', 'æ']