Given the following pattern:
group1: hello, group2: world
group1: hello (hello, world) world, group2: world
group1: hello world
of the style <group_name>: <group_value>[, <group_name>: <group_value>[...]]
.
In general I use the following regex to extract the values:
group1:\s(?P<group1>[^,\n]+)(:?,\sgroup2:\s(?P<group2>[^,\n]+))?\n
which works file unless a ,
exists inside the group_value
.
I know that this toyexample can be solved by something like:
group1:\s(?P<group1>.+?)(?:,\sgroup2:\s(?P<group2>.+?))?\n
However I do want to protect myself agains matching everything accidentally so I would still like to limit my match when it encounters a ,
.
Question: Is there a (general) way to match up to ,
and for that purpose ignore ,
s that are in brackets?
Using pcre, you could make use of a recursive pattern for balanced parenthesis with possessive quantifiers.
you define the pattern for group 1, and if the same logic applies for group 2 you can recurse the subpattern defined in group 1.
As you exclude matching a newline in the negated character class, you might use \h
to match horizontal whitespace characters instead of using \s
\bgroup1:\h+(?P<group1>(?:[^,\n()]*(?:(\((?:[^()\n]+|(?2))*+\)))?)*+)(?:,\h+group2:\h+(?P<group2>\g<group1>))?\R
\bgroup1:\h+
Match the word group1 and then :
and 1+ horizontal whitespace chars(?P<group1>
Named group1
(?:
Non capture group
[^,\n()]*
Match optional chars other than ,
newline (
or )
(?:
Non capture group
(\((?:[^()\n]+|(?2))*+\))
Match balanced parenthesis recursing group 2)?
Close group and make it optional)*+
Close the group and optionally repeat with a possessive quantifier (no backtracking))
Close group1(?:
Non capture group
,\h+group2:\h+
Match group2: between horizontal whitespace chars(?P<group2>\g<group1>)
Named group2, recurse the subpattern in named group1)?
Close the non capture group and make it optional\R
Match a newline