Search code examples
pythonregexregex-lookarounds

Regex Python: Negative Lookahead delete/keep digits at the beginning


The purpose is to keep cardinals and ordinals numbers at the beginning of the string as long as they are immediately before either word PERFORMANCE or SCORE:

#These numbers are kept:
100 SCORE FOR STUDENT
80 PERFORMANCE FOR TEACHER

However, if the numbers are at the start and the following word is different, then they should be removed:

#These numbers are removed
10095TH 10097TH 179TH SCHOOL ANIVERSARY
11 12 10 SECONDARY LEVELS
100 100 100 100 SCHOOL AGREEMENT

The issue I have is when before the word PERFORMANCE or SCORE there are digits separated by space:

#All numbers should be kept
3 10 100 PERFORMANCE
001 10 12345 SCORE

I am applying the following regex, but the last section is messy (?!\s*\d*\s*\d*\s*(?:PERFORMANCE|SCORE)\b) because currently this is just considering 3 sets of numbers before PERFORMANCE or SCORE to be kept:

(?<=[A-Za-z]\b )([ 0-9]*(ST|[RN]D|TH)?\b)|^(([\d ]+(ST|[RN]D|TH)?)*\b)(?!\s*\d*\s*\d*\s*(?:PERFORMANCE|SCORE)\b)

The previous regex works for the following:

3 10 100 PERFORMANCE
001 10 12345 SCORE

But it will not work if I add an additional set of digits:

3 10 100 1 PERFORMANCE
001 10 1 12345 SCORE

How can I generalize this rule to include all the set of digits?

Thanks


Solution

  • Try the following:

    ^(?:\d+(?:ST|[RN]D|TH)?\s)+(?=[^\d]+$)(?!PERFORMANCE|SCORE)
    
    ^                       anchor to beginning
    (?:                     start non-capturing group
        \d+                 match one or more digits
        (?:ST|[RN]D|TH)?    optionally followed by one of your approved suffixes
        \s                  then a whitespace
    )+                      one or more times
    (?=[^\d]+$              assert that the rest of the line is number-free (forces the regex to not backtrack to the last number)
    (?!PERFORMANCE|SCORE)   assert that the following characters are NOT 'PERFORMANCE' or 'SCORE'