Search code examples
regexexpressionregex-groupregex-lookaroundsregexp-replace

Extract the input parameters from the string using regex into parameter and value group


I need help extracting/splitting these parameters within the string using regex into parameter and value groups.

Input -> collection_a=['U1', 'U2'], collection_b=['U1', 'U2']

output -> Group Parameter = collection_a
          Group Value = ['U1', 'U2']

          Group Parameter = collection_b
          Group Value = ['U1', 'U2']

Input -> collection=['U1', 'U2'], callback_macro=utils.user_email(user_id=$$)

output -> Group Parameter = collection
          Group Value = ['U1', 'U2']

          Group Parameter = callback_macro
          Group Value = utils.user_email(user_id=$$)

Input -> collection=['U1', 'U2'], callback_macro=utils.user_email(user_id=$$, config={'user': 'ADMIN'})

output -> Group Parameter = collection
          Group Value = ['U1', 'U2']

          Group Parameter = callback_macro
          Group Value = utils.user_email(user_id=$$)

Input -> collection=['U1','U2'], callback_macro=string.replace(value=$$, pattern=^(.*)$, replacement={'user': $1})

output -> Group Parameter = collection
          Group Value = ['U1', 'U2']

          Group Parameter = callback_macro
          Group Value = string.replace(value=$$, pattern=^(.*)$, replacement={'user': $1})

I'm using this Regex /((\s*(?<parameter>[a-z_]+)\s*=\s*(?<value>((?!(,\s*[a-z_]+)\s*=\s*).)*)),{1,})/g and it works perfectly in case 1 and case 2 but breaks in case 3 & 4 as in case 3 & 4 it contains = within the argument's value.

Regex link - https://regex101.com/r/U2CaLb/1


Solution

  • It would be very difficult to do a split with these requirements.
    You could just match key/value and put them into an array.
    Note that the third input sample does not follow the other ones pattern.

    Here are some options to choose from.

    Method 1 This regex uses a single level braces matching as part of it.

    (\w+)=((?:\[.*?\]|\(.*?\)|{.*?}|[^,\r\n])*)
    

    https://regex101.com/r/fnPb6e/1

     ( \w+ )                       # (1)
     =
     (                             # (2 start)
        (?:
           \[ .*? \] 
         | \( .*? \) 
         | { .*? } 
         | 
           [^,\r\n] 
        )*
     )                             # (2 end)
    

    For Balanced braces text, PCRE or Not-Net engines :

    Method 2 This regex is the simple version as nesting of (),[],{}
    where the balanced end is independently found to completion.

    (\w+)=((?:(\[(?:[^\[\]]++|(?3))*\])|(\((?:[^()]++|(?4))*\))|({(?:[^{}]++|(?5))*})|[^,\r\n])*)
    

    https://regex101.com/r/dgWMDZ/1

     ( \w+ )                       # (1)
     =
     (                             # (2 start)
        (?:
           (                             # (3 start)
              \[
              (?:
                 [^\[\]]++ 
               | (?3)
              )*
              \]
           )                             # (3 end)
         | (                             # (4 start)
              \(
              (?:
                 [^()]++ 
               | (?4)
              )*
              \)
           )                             # (4 end)
         | (                             # (5 start)
              {
              (?:
                 [^{}]++ 
               | (?5)
              )*
              }
           )                             # (5 end)
         | [^,\r\n]
        )*
     )                             # (2 end)
    

    Method 3 This regex will balance the different braces nested inside
    each other if applicable. This will allow for more internal brace structure
    items and is probably not needed for this scenario, but could be in the future.
    This is the enhanced Method 2 version and encompasses that functionality.

    (\w+)=((?:(\[(?:[^\[\](){}]++|(?3))*\]|\((?:[^\[\](){}]++|(?3))*\)|{(?:[^\[\](){}]++|(?3))*})|[^,\r\n])*)
    

    https://regex101.com/r/ZubSke/1

     ( \w+ )                       # (1)
     =
     (                             # (2 start)
        (?:
           (                             # (3 start)
              \[ 
              (?:
                 [^\[\](){}]++ 
               | (?3) 
              )*
              \] 
            | 
              \( 
              (?:
                 [^\[\](){}]++ 
               | (?3) 
              )*
              \) 
            | 
              {
              (?:
                 [^\[\](){}]++ 
               | (?3) 
              )*
              }
           )                             # (3 end)
         | 
           [^,\r\n] 
        )*
     )                             # (2 end)
    

    Method 4 Same as method 3 and added handling of simple single or double quoted strings.
    This regex will blend in quote parsing within any other delimiter pair
    balance, as well as outside of these delimiters.
    Note that there is a pass through, garbage collector [^,\r\n] to get catch unbalanced
    delimiters. This is by design as balanced text is really just a suggestion during the fleshing out process.

    (\w+)=((?:(\[(?:[^\[\](){}'"]++|(?4)|(?3))*\]|\((?:[^\[\](){}'"]++|(?4)|(?3))*\)|{(?:[^\[\](){}'"]++|(?4)|(?3))*})|('[^'\r\n]*?'|"[^"\r\n]*?")|[^,\r\n])*)
    

    https://regex101.com/r/090iI7/1

     ( \w+ )                       # (1)
     =
     (                             # (2 start)
        (?:
           (                             # (3 start)
              \[ 
              (?:
                 [^\[\](){}'"]++ 
               | (?4) 
               | (?3) 
              )*
              \] 
            | 
              \( 
              (?:
                 [^\[\](){}'"]++ 
               | (?4) 
               | (?3) 
              )*
              \) 
            | 
              {
              (?:
                 [^\[\](){}'"]++ 
               | (?4) 
               | (?3) 
              )*
              }
           )                             # (3 end)
         | (                             # (4 start)
              ' [^'\r\n]*? '
            | 
              " [^"\r\n]*? "
           )                             # (4 end)
         | 
           [^,\r\n] 
           
        )*
     )                             # (2 end)