Search code examples
pythonregexregex-lookarounds

Regex negative lookahead implementations


I am trying to implement a negative-lookahead for my task.

I have to add kgs into a negative-lookahead after numeric part.

So far I have tried this regex:

total\samount\s?\:?\s?[0-9\,\.]+\s(?!kgs)(?!\ kgs)

Given the text:

task1. total amount 5,887.99 kgs
task2. total amount 5,887.99kgs
task3. total amount 5,887.99 usd
task4. total amount 5,887.99usd

I want to match task3 and task4 but not task1 and task2.

So far I am able to reject task1/task2 and match task3 but failing to match task4.


Solution

  • You may emulate an atomic group that Python re does not support.

    For that purpose, you may use

    total\s+amount\s*(?::\s*)?(?=(\d[\d,.]*))\1(?!\s*kgs)
    

    See the regex demo

    Details

    • total\s+amount - total, 1+ whitespaces, amount
    • \s* - 0+ whitespaces
    • (?::\s*)? - an optional group matching 1 or 0 occurrences of : and 0+ whitespaces
    • (?=(\d[\d,.]*)) - a positive lookahead that matches and captures into Group 1 a digit and then 0 or more digits, dots or commas
    • \1 - the value of the capturing group #1 (nobacktracking is allowed into a backreference, thus the subsequent lookahead will only be triggered once and if it fails, the whole match will fail)
    • (?!\s*kgs) - a negative lookahead that fails the match if there are 0+ whitespaces and then kgs immediately to the right of the current location.

    In Python, use

    pattern = r'total\s+amount\s*(?::\s*)?(?=(\d[\d,.]*))\1(?!\s*kgs)'
    

    NOTE: With PyPi regex module that supports atomic groups and possessive quantifiers, you may just use

    total\s+amount\s*(?::\s*)?\d[\d,.]*+(?!\s*kgs)
    #                                 ^^
    

    See the regex demo (PHP option is set since this will have the same behavior in Python code).

    The *+ 0 or more quantifier is posessive, once the digits, commas and dots are matched, the pattern will never be retried and the negative lookahead check will be only performed once.

    Python test online:

    import regex, re
    
    texts = ['task1. total amount 5,887.99 kgs','task2. total amount 5,887.99kgs','task3. total amount 5,887.99 usd','task4. total amount 5,887.99usd']
    re_rx = r'total\s+amount\s*(?::\s*)?(?=(\d[\d,.]*))\1(?!\s*kgs)'
    regex_rx = r'total\s+amount\s*(?::\s*)?\d[\d,.]*+(?!\s*kgs)'
    
    for s in texts:
        m_rx = re.search(re_rx, s)
        if m_rx:
            print("'", m_rx.group(), "' matched in '", s,"' with re pattern", sep="")
        m_regex = regex.search(regex_rx, s)
        if m_regex:
            print("'", m_regex.group(), "' matched in '", s,"' with regex pattern", sep="")
    

    Output:

    'total amount 5,887.99' matched in 'task3. total amount 5,887.99 usd' with re pattern
    'total amount 5,887.99' matched in 'task3. total amount 5,887.99 usd' with regex pattern
    'total amount 5,887.99' matched in 'task4. total amount 5,887.99usd' with re pattern
    'total amount 5,887.99' matched in 'task4. total amount 5,887.99usd' with regex pattern