Search code examples
c#regexbalancing-groupsexpresso

How to make balancing group capturing?


Let's say I have this text input.

 tes{}tR{R{abc}aD{mnoR{xyz}}}

I want to extract the ff output:

 R{abc}
 R{xyz}
 D{mnoR{xyz}}
 R{R{abc}aD{mnoR{xyz}}}

Currently, I can only extract what's inside the {}groups using balanced group approach as found in msdn. Here's the pattern:

 ^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$

Does anyone know how to include the R{} and D{} in the output?


Solution

  • I think that a different approach is required here. Once you match the first larger group R{R{abc}aD{mnoR{xyz}}} (see my comment about the possible typo), you won't be able to get the subgroups inside as the regex doesn't allow you to capture the individual R{ ... } groups.

    So, there had to be some way to capture and not consume and the obvious way to do that was to use a positive lookahead. From there, you can put the expression you used, albeit with some changes to adapt to the new change in focus, and I came up with:

    (?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))
    

    [I also renamed the 'Open' to 'O' and removed the named capture for the close brace to make it shorter and avoid noises in the matches]

    On regexhero.net (the only free .NET regex tester I know so far), I got the following capture groups:

    1: R{R{abc}aD{mnoR{xyz}}}
    1: R{abc}
    1: D{mnoR{xyz}}
    1: R{xyz}
    

    Breakdown of regex:

    (?=                         # Opening positive lookahead
        ([A-Z]                  # Opening capture group and any uppercase letter (to match R & D)
            (?:                 # First non-capture group opening
                (?:             # Second non-capture group opening
                    (?'O'{)     # Get the named opening brace
                    [^{}]*      # Any non-brace
                )+              # Close of second non-capture group and repeat over as many times as necessary
                (?:             # Third non-capture group opening
                    (?'-O'})    # Removal of named opening brace when encountered
                    [^{}]*?     # Any other non-brace characters in case there are more nested braces
                )+              # Close of third non-capture group and repeat over as many times as necessary
            )+                  # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
            (?(O)(?!))          # Condition to prevent unbalanced braces
        )                       # Close capture group
    )                           # Close positive lookahead
    

    The following will not work in C#

    I actually wanted to try out how it should be working out on the PCRE engine, since there was the option to have recursive regex and I think it was easier since I'm more familiar with it and which yielded a shorter regex :)

    (?=([A-Z]{(?:[^{}]|(?1))+}))
    

    regex101 demo

    (?=                    # Opening positive lookahead
        ([A-Z]             # Opening capture group and any uppercase letter (to match R & D)
            {              # Opening brace
                (?:        # Opening non-capture group
                    [^{}]  # Matches non braces
                |          # OR
                    (?1)   # Recurse first capture group
                )+         # Close non-capture group and repeat as many times as necessary
            }              # Closing brace
        )                  # Close of capture group
    )                      # Close of positive lookahead