Search code examples
pythonregexregex-lookarounds

Regex Python: Keep first digits


The objective is to keep the first digits in a string, but remove them if they are in different place.

For instance, just this numbers should be kept:

123456 AB
123456 GENERAL
123456 HOSPITAL

On the other hand, these numbers should be removed:

PROJECT 150000 SCHOLARSHIPS
SUMMERLAND 05 100 SCHOOL 100 ABC
ABC HOSPITAL 01 20 30 GENERAL
ABC HOSPITAL 01

I have crafted this regex which is very near to the mentioned behaviour and substituting for empty space:

(?<=\w\b )([0-9]*)

However, I am getting some an additional space when removing the digits which is coming from the preceding space:

123456 AB
123456 GENERAL
123456 HOSPITAL

PROJECT  SCHOLARSHIPS
SUMMERLAND   SCHOOL  ABC
ABC HOSPITAL    GENERAL
ABC HOSPITAL 

How can I get rid of this space?


Solution

  • To keep the first digits in the string, you could also use a capturing group with an alternation instead of a lookbehind. Capture in a group what you want to keep, and match what you don't want to keep.

    ^([^\S\r\n]*\d+)|\d+[^\S\r\n]*
    
    • ^ Start of string
    • ( Capture group 1 (what you want to keep)
      • [^\S\r\n]*\d+ Match optional whitespace chars except newlines, match 1+ digits
    • ) Close group
    • | Or
    • \d+[^\S\r\n]* Match 1+ digits followed by optional whitespace chars except newlines (What you want to remove)

    Regex demo | Python demo

    For example

    result = re.sub(regex, r'\1', test_str, 0, re.MULTILINE)
    

    Output

    123456 AB
    123456 GENERAL
    123456 HOSPITAL
    
    PROJECT SCHOLARSHIPS
    SUMMERLAND SCHOOL ABC
    ABC HOSPITAL GENERAL
    ABC HOSPITAL