Search code examples
pythonstringmultilingual

Keep only alphabetic characters (multilingual) in a string


On stackoverflow there are a lot of answers about how to keep only alphabetic characters from a string, the most common accepted is the famous regex '[^a-zA-Z]'. But this answer is totally wrong because it supposes everybody only write English... I thought I could down vote all these answers but I finally thought it would be more constructive to ask the question again, because I can't find the answer.

Is there an easy (or not...) way in python to keep only alphabetic characters from a string that works for all languages ? I think maybe about a library that could do like xregexp in javascript... By all languages I mean english but also french, russian, chinese, greec...etc


Solution

  • [^\W\d_]

    With Python3 or the re.UNICODE flag in Python2, you could use [^\W\d_].

    \W : If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

    So [^\W\d_] is anything which is not not alphanumeric or not a digit or not an underscore. In other words, it's any alphabetic character. :)

    >>> import re
    >>> re.findall("[^\W\d_]", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
    ['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']
    

    Remove digits first, then look for "\w"

    To avoid this convoluted logic, you could also remove digits and underscores first, and then look for alphanumeric characters :

    >>> without_digit = re.sub("[\d_]", "", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE) 
    >>> re.findall("\w", without_digit, re.UNICODE)
    ['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']
    

    regex module

    It seems that regex module could help, since it understands \p{L} or [\w--\d_].

    This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

    >>> import regex as re
    >>> re.findall("\p{L}", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
    ['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']
    

    (Tested with Anaconda Python 3.6)