Search code examples
regexoptimizationpcre

Repeat capturing group of comma seperated key value pairs


I'm currently trying to extract the following from patterns like @Apple(kind="Bax", priority=33)

  • Apple
  • [kind, Bax], [priority, 33]

What I currently use is @([^(]*)\(([^\)]*)\). Then I have Apple and kind="Bax", priority=33. After this, I split group 2 on ,, then split on = and finally remove " if any at start or end.

Now this will traverse the second segment a lot. First for the regex capture, then to find all ,, then for each traverse again to find = etc etc.

Since I do this millions of times, is there any way to capture it withing the regex traversal? I'd like to avoid all the splits.


Solution

  • Assuming you want to allow the count of key=value pairs to be arbitrary length, how about:

    (?:@|\(|,\s*|="?)(\w+)(?=\(|=|"|,|\))
    

    Demo

    All captures are in Group 1.

    • (?:@|\(|,\s*|="?) matches one of @, (, , plus whitespaces more than 0, or = plus an optional ".
    • (\w+) matches the desired word and is captured in Group1.
    • (?=\(|=|"|,|\)) is a positive lookahead assertion to match one of (, =, ", , or ).

    [Edit]

    If the element enclosed by the double quotes may contain a comma, It will not be easy to parse it with a single regex. If possible, it will be less maintenable. I would divide the operation in two steps. Suppose we have a string:

    @Apple(val="a,b", kind="Bax", priority=33,foo=bar, name="John Doe", lorem=ipsum)
    

    Then with the 1st regex:

    ^@([^(]+)\(([^)]+)\)
    

    Apple is captured in Group 1 and the substring enclosed in the parentheses is captured in Group 2.

    Then apply the next regex to the Group 2:

    (?<=")[^"=]+(?=")|[^,=" ]+
    

    Now we can obtain the list:

    ['val', 'a,b', 'kind', 'Bax', 'priority', '33', 'foo', 'bar', 'name', 'John Doe', 'lorem', 'ipsum']
    

    Demo of the 2nd regex