Search code examples
pythonregexunicodejython

Why is "\p{L}" not working in this regex?


OS: Windows 7. Jython 2.7.0 "final release".

for token in sorted_cased.keys():
    freq = sorted_cased[ token ]
    if freq > 1:
        print( 'token |%s| unicode? %s' % ( token, isinstance( token, unicode ), ) )
        if re.search( ur'\p{L}+', token ):
            print( '  # cased token |%s| freq %d' % ( token, freq, ))

sorted_cased is a dict showing the frequency of occurrence of tokens. Here I'm trying to weed out the words (unicode characters only) which occur with frequency > 1. (NB I was using re.match rather than search but search should detect event 1 such \p{L} in token)

sample output:

token |Management| unicode? True
token |n| unicode? True
token |identifiés| unicode? True
token |décrites| unicode? True
token |agissant| unicode? True
token |tout| unicode? True
token |sociétés| unicode? True

None is recognising that it has a single [p{L}] in it. I've tried all sorts of permutations: double-quotes, adding flags=re.UNICODE, etc.

later I have been asked to explain why this cannot be classed as a duplicate of How to implement \p{L} in python regex. It CAN, but... the answers in that other question do not draw attention to the need to use the REGEX MODULE (old version? very new version? NB they are different) as opposed to the RE MODULE. For the sake of saving the hair follicles and sanity of future people who come up against this one, I request that the present paragraph be allowed to remain, albeit the question be "duped".

Also my attempt to install Pypi regex module FAILED UNDER JYTHON (using pip). Probably better to use java.util.regex.


Solution

  • If you have access to Java java.util.regex, the best option is to use the in-built \p{L} class.

    Python (including the Jython dialect) does not support \p{L} and other Unicode category classes. Nor the POSIX character classes.

    Another alternative is to restrict \w class like (?![\d_])\w and use a UNICODE flag. If UNICODE is set, this \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.. This alternative has one flaw: it cannot be used inside a character class.

    Another idea is to use [^\W\d_] (with re.U flag) that will match any char that is not a non-word (\W), digit (\d) and _ char. It will effectively match any Unicode letter.