Search code examples
regexpcre

Regex: Match until "," but not if "," is inside brackets


Given the following pattern:

group1: hello, group2: world
group1: hello (hello, world) world, group2: world
group1: hello world

of the style <group_name>: <group_value>[, <group_name>: <group_value>[...]].

In general I use the following regex to extract the values:

group1:\s(?P<group1>[^,\n]+)(:?,\sgroup2:\s(?P<group2>[^,\n]+))?\n

which works file unless a , exists inside the group_value.

I know that this toyexample can be solved by something like:

group1:\s(?P<group1>.+?)(?:,\sgroup2:\s(?P<group2>.+?))?\n

However I do want to protect myself agains matching everything accidentally so I would still like to limit my match when it encounters a ,.

Question: Is there a (general) way to match up to , and for that purpose ignore ,s that are in brackets?


Solution

  • Using pcre, you could make use of a recursive pattern for balanced parenthesis with possessive quantifiers.

    you define the pattern for group 1, and if the same logic applies for group 2 you can recurse the subpattern defined in group 1.

    As you exclude matching a newline in the negated character class, you might use \h to match horizontal whitespace characters instead of using \s

    \bgroup1:\h+(?P<group1>(?:[^,\n()]*(?:(\((?:[^()\n]+|(?2))*+\)))?)*+)(?:,\h+group2:\h+(?P<group2>\g<group1>))?\R
    
    • \bgroup1:\h+ Match the word group1 and then : and 1+ horizontal whitespace chars
    • (?P<group1> Named group1
      • (?: Non capture group
        • [^,\n()]* Match optional chars other than , newline ( or )
        • (?: Non capture group
          • (\((?:[^()\n]+|(?2))*+\)) Match balanced parenthesis recursing group 2
        • )? Close group and make it optional
      • )*+ Close the group and optionally repeat with a possessive quantifier (no backtracking)
    • ) Close group1
    • (?: Non capture group
      • ,\h+group2:\h+ Match group2: between horizontal whitespace chars
      • (?P<group2>\g<group1>) Named group2, recurse the subpattern in named group1
    • )? Close the non capture group and make it optional
    • \R Match a newline

    Regex demo