Search code examples
pythonregexquantifiers

Alternative to possessive quantifier in python


I am trying to match all occurences of the String Article followed by a number (single or more digits) which are not followed by an opening parentheses. In Sublime Text, I am using the following regex:

Article\s[0-9]++(?!\()

to search the following String:

Article 29
Article 30(1)

which does not match Article 30(1) (as I expect it to) but Article 29 and Article 1.

When attempting to do the same in Python (3) using

import re
article_list = re.findall(r'Article\s[0-9]++(?!\()', "Article 30(1)")

I get an the following Error as I am using a (nested) possessive quantifier which is not supported by Python regex. Is there any way to match what I want it (not) to match in Python?


Solution

  • Python 3.11 Update

    Possessive quantifiers and atomic groups are now supported. See What’s New In Python 3.11:

    Atomic grouping ((?>...)) and possessive quantifiers (*+, ++, ?+, {m,n}+) are now supported in regular expressions. (Contributed by Jeffrey C. Jacobs and Serhiy Storchaka in bpo-433030.)

    That means, re.findall(r'Article\s[0-9]++(?!\()', "Article 30(1)") should now work as expected.

    Additional reference:

    In short: these constructs disallow backtracking into the quantified pattern.

    Here, in [0-9]++(?!\(), the digits will be matched and consumed, and the negative lookahead will be checked only once, right after the last digit consumed with [0-9]++, and if there is no ( after that digit, the regex match will fail, and no match will be returned. If you use [0-9]+(?!\(), the regex engine would backtrack upon matching the last digit and finding out that there is no ( after it, it would re-position its search index right before the last digit matched with [0-9]+ and - righteously confirming the last digit is not a ( char - will return a match with the last digit in that number truncated.

    Legacy answer

    Python re does not support possessive quantifiers. You may consider using Python PyPi regex module instead, that supports this type of quantifiers. Or use the following work-arounds.

    You need to either add a digit to the lookahead:

    Article\s[0-9]+(?![(0-9])
                        ^^^   
    

    See this regex demo.

    Alternatively, use a word boundary:

    Article\s[0-9]+\b(?!\()
                    ^
    

    See this regex demo.