Search code examples
regexregex-group

Regex help needed to parse and extract property from an expression tree


Here is a valid property tree expression (it can be recursive):

rootProperty:(prop1, prop2, subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3), prop3, etc)

So in effect a property can have many properties and sub-properties. From this expression I would like to capture the following:

  • rootProperty
  • prop1
  • prop2
  • subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3)
  • prop3

I tried few approaches but could't get the repetitions working recursively. Hence seeking help.

Thanks Kannan


Solution

  • This is not a regular language due to recursion (balanced parens), so a regular expression might not be what you need. But assuming you know what you are doing:

    ([^:(), ]+)(?::\(((?R)?(?:, ?(?R))*)\))?
    

    First we capture the name of the property: one or more characters that are not :(), .

    ([^:(), ]+)
    

    A property may or may not have a subtree, so the next part is the optional subtree:

    (?:           <--- do not capture
       :          <--- literal ':'
       \(         <--- literal '('
          ...     <--- some stuff inside
       \)         <--- literal ')'
    )?            <--- it is optional
    

    The stuff inside captures a list of properties:

    (             <--- do capture
     (?R)         <--- recursively match a property
     (?:          <--- do not capture
        , ?       <--- comma followed by optional space
        (?R)      <--- recursively match another property
     )*           <--- any number of comma separated properties
    )             <--- end capture
    

    For your example input:

    Input:
        rootProperty:(prop1, prop2, subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3), prop3, etc)
    Match 1:
        rootProperty:(prop1, prop2, subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3), prop3, etc)
        Group 1:
            rootProperty
        Group 2:
            prop1, prop2, subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3), prop3, etc
    

    You could then recursively match the second group of each match for capturing the properties of a subtree. There should be a way to get the backtracking information so you don't need to do this, but I don't know how.

    Input:
        prop1, prop2, subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3), prop3, etc
    Match 1:
        prop1
    Match 2:
        prop2
    Match 3:
        subProp1:(prop1,subSubProp1:(prop1,prop2,etc),prop3)
        Group 1:
            subProp1
        Group 2:
            prop1,subSubProp1:(prop1,prop2,etc),prop3
    Match 4:
        prop3
    Match 5:
        etc
    

    Then,

    Input:
        prop1,subSubProp1:(prop1,prop2,etc),prop3
    Match 1:
        prop1
    Match 2:
        subSubProp1:(prop1,prop2,etc)
        Group 1:
            subSubProp1
        Group 2:
            prop1,prop2,etc
    Match 3:
        prop3
    

    And finally:

    Input:
        prop1,prop2,etc
    Match 1:
        prop1
    Match 2:
        prop2
    Match 3:
        etc
    

    https://regex101.com/r/WAXrFd/2