Search code examples
pythonregexlocale

How to find out which chars are defined as alphanumeric for a given locale


So with python regex matching, we have the meaning of \w and others affected by the re.LOCALE flag:

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale.

So we want to find out what characters are defined as alphanumeric for a given locale - say we did 'locale -a' and we have a list of locales, and want to find the info for one of the listed locales on the system. Any method to find the info quickly: a python code snippet or one-liner, shell command or maybe reference material somewhere.


Solution

  • Use string.letters.

    Example:

    >>> import locale
    >>> import string
    >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    'en_US.UTF-8'
    >>> string.letters
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    >>> locale.setlocale(locale.LC_ALL, 'de_DE')
    'de_DE'
    >>> string.letters
    'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
    >>>