Search code examples
pythonregexregex-lookaroundsregex-negationregex-group

Regex: remove strings after slash just when they are more than one word


How to remove string after slash just when there are more than one word in the string? In specific, consider the following string:

    0      1     2        0       1      2   3   
 CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS

All the characters after slash should be removed because there are 4 words (HOPITAL, CENTRALE, DE, SOINS) and the limit is just one. Then the result is: CENTRAL CARE HOSPITAL

On the other hand, we have the following string:

   0     1     2    3  0
HAPPY SPRING BREAK 20/20

20 this time has to be kept because it is just one word (\b[A-Za-z0-9]\b). Then, the / slash should be replaced by empty space. The result should look like the following: HAPPY SPRING BREAK 20 20

Suppose the following test set:

CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS
ELEMENTARY/INSTITUTION
FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO
HAPPY SPRING BREAK 20/20

The result should be the following:

CENTRAL CARE HOSPITAL
ELEMENTARY INSTITUTION
FOUNDATION INSTITUTION
HAPPY SPRING BREAK 20 20

Overall, just keep the strings after slash just when it is one word and add an space where the slash was located. Otherwise, remove the strings after slash

I have tried this regex so far, but not working: (?:[\/])([A-Z0-9]*\b)(?!\b[A-Z]*)|[^\/]*$

Thanks


Solution

  • You may use

    import re
    rx = r'/(\w+(?:\W+\w+)+\W*$)?'
    strs = ['CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS','ELEMENTARY/INSTITUTION','FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO','HAPPY SPRING BREAK 20/20']
    for s in strs:
        print( re.sub(rx, lambda x: "" if x.group(1) else " ", s) )
    

    See the Python demo online. Output:

    CENTRAL CARE HOSPITAL
    ELEMENTARY INSTITUTION
    FOUNDATION INSTITUTION
    HAPPY SPRING BREAK 20 20
    

    The regex is /(\w+(?:\W+\w+)+\W*$)?, see its online demo. It matches:

    • / - a slash
    • (\w+(?:\W+\w+)+\W*$)? - an optional capturing group #1 that matches
      • \w+ - 1+ word chars
      • (?:\W+\w+)+ - 1+ sequences of 1+ non-word chars followed with 1+ word chars
      • \W* - zero or more non-word chars
      • $ - end of string.