Search code examples
pythonregexpython-re

get all occurences from a string based on find pattern in python


Suppose I have a string like this:

exp = 'CASE WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'CPU\'  THEN   \'YES\'  WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'RAM\'  THEN   \'YES\' ELSE  \'NO\' END' 
exp2 = 'CASE WHEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"*"Expressions"."ORDER_ITEMS"."UNIT_PRICE")>0  THEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"* "Expressions"."ORDER_ITEMS"."UNIT_PRICE") ELSE ("Expressions"."ORDER_ITEMS"."QUANTITY"+ "Expressions"."ORDER_ITEMS"."UNIT_PRICE")   END '

I want to return the all occurrences of WHEN and THEN along with the texts of that.

This is the expected output of exp1

['WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'CPU\'  THEN   \'YES\'','WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'RAM\'  THEN   \'YES\'']

This is the expected output of exp2

['WHEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"*"Expressions"."ORDER_ITEMS"."UNIT_PRICE")>0  THEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"* "Expressions"."ORDER_ITEMS"."UNIT_PRICE")']

What I have tried is this:

res = re.findall(r'\s*(WHEN|When|when)+\s*(.*)\s*(THEN|Then|then)+\s*')

But the resulting list shows this output in my case

['(WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'CPU\'  THEN   \'YES\'  WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'RAM\'  THEN)']

Solution

  • Try:

    WHEN (?:(?! +(?:WHEN|ELSE)).)* # with flags=re.I
    
    1. WHEN - Matches 'WHEN '
    2. (?:(?! +(?:WHEN|ELSE)).) - Uses negative lookahead and states that as long as the current position does not match one or more space characters followed by either 'WHEN' or 'ELSE', then match one more character.

    See Regex Demo

    import re
    
    cases = [
        'CASE WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'CPU\'  THEN   \'YES\'  WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'RAM\'  THEN   \'YES\' ELSE  \'NO\' END',
        'CASE WHEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"*"Expressions"."ORDER_ITEMS"."UNIT_PRICE")>0  THEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"* "Expressions"."ORDER_ITEMS"."UNIT_PRICE") ELSE ("Expressions"."ORDER_ITEMS"."QUANTITY"+ "Expressions"."ORDER_ITEMS"."UNIT_PRICE")   END '
    ]
    
    for case in cases:
        res = re.findall(r'WHEN (?:(?! +(?:WHEN|ELSE)).)*', case, flags=re.I)
        print(res)
    

    Prints:

    ['WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'CPU\'  THEN   \'YES\'', 'WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'RAM\'  THEN   \'YES\'']
    ['WHEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"*"Expressions"."ORDER_ITEMS"."UNIT_PRICE")>0  THEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"* "Expressions"."ORDER_ITEMS"."UNIT_PRICE")']
    

    Update

    If you want to group the WHEN and ELSE parts (stripped of leading and trailing spaces), then use the following regex:

    WHEN +(.*?) +THEN +((?:(?! +(?:WHEN|ELSE)).)*)
    
    import re
    
    cases = [
        'CASE WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'CPU\'  THEN   \'YES\'  WHEN  "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"=\'RAM\'  THEN   \'YES\' ELSE  \'NO\' END',
        'CASE WHEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"*"Expressions"."ORDER_ITEMS"."UNIT_PRICE")>0  THEN  ("Expressions"."ORDER_ITEMS"."QUANTITY"* "Expressions"."ORDER_ITEMS"."UNIT_PRICE") ELSE ("Expressions"."ORDER_ITEMS"."QUANTITY"+ "Expressions"."ORDER_ITEMS"."UNIT_PRICE")   END '
    ]
    
    for case in cases:
        results = re.findall(r'WHEN +(.*?) +THEN +((?:(?! +(?:WHEN|ELSE)).)*)', case, flags=re.I)
        for result in results:
            print(result[0], result[1])
    

    Prints:

    "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"='CPU' 'YES'
    "Expressions"."PRODUCT_CATEGORIES"."CATEGORY_NAME"='RAM' 'YES'
    ("Expressions"."ORDER_ITEMS"."QUANTITY"*"Expressions"."ORDER_ITEMS"."UNIT_PRICE")>0 ("Expressions"."ORDER_ITEMS"."QUANTITY"* "Expressions"."ORDER_ITEMS"."UNIT_PRICE")