Search code examples
c#regexregex-group

Capture outer paren/bracket groups while ignoring inner paren groups


This is a permutation of my previous SO question. The answer worked perfectly for me until I ran into an edge case that caused a problem. I now need a tweaked regex pattern. I have tried to work it out on my own at Regex Storm, but my knowledge of regex isn't quite advanced enough for this.

The one change from my previous post (linked above) is that I am now only interested in matching paren groupings that begin with ([ instead of merely (. The end of the grouping remains the same: )

For the sake of completeness, here is the entire previous question, modified for the new requirement:

I'm using C# and regex, trying capture outer paren groups while ignoring inner paren groups. I have legacy-generated text files containing thousands of string constructions like the following:

([txtData] of COMPOSITE
(dirty FALSE)
(composite [txtModel])
(view [star3])
(creationIndex 0)
(creationProps )
(instanceNameSpecified FALSE)
(containsObject nil)
(sName txtData)
(txtDynamic FALSE)
(txtSubComposites )
(txtSubObjects )
(txtSubConnections )
)

([txtUI] of COMPOSITE
(dirty FALSE)
(composite [txtModel])
(view [star2])
(creationIndex 0)
(creationProps )
(instanceNameSpecified FALSE)
(containsObject nil)
(sName ApplicationWindow)
(txtDynamic FALSE)
(txtSubComposites )
(txtSubObjects )
(txtSubConnections )
)

([star38] of COMPOSITE
(dirty FALSE)
(composite [txtUI])
(view [star39])
(creationIndex 26)
(creationProps composite [txtUI] sName Bestellblatt)
(instanceNameSpecified TRUE)
(containsObject COMPOSITE)
(sName Bestellblatt)
(txtDynamic FALSE)
(txtSubComposites )
(txtSubObjects )
(txtSubConnections )
)

I am looking for a regex that will capture the 3 groupings in the example above, and here is the solution from the previous SO post:

Regex regex = new Regex(@"\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)"); 
return regex.Matches(str);

I need a slight tweak to the regex pattern above so that it only matches groupings that begin with ([ and not merely with (. The end remains the same: )

The requirement match is simple:

  1. Opening paren + square bracket (([) is either the first character in the file, or it follows a newline.
  2. Closing paren is either the last character in the file, or it is followed by a newline.

I want the regex pattern to ignore all paren-groupings that don't obey numbers 1 and 2 above. By "ignore" I mean that they shouldn't be seen as a match - but they should be returned as part of the outer grouping match.

So, for my objective to be met, when my C# regex runs against the example above, I should get back a regex MatchCollection with exactly 3 matches, just as shown above.

How is it done?


Solution

  • You may apply a positive lookahead at the start of the pattern that would require the [ after the initial (. Also, since the leading ([ can only appear at the start of a line and closing ) can only appear at the end of a line, it makes sense to add ^ and \r?$ anchors (note \r? is necessary as $ in the multiline mode only matches a location before \n, not before \r).

    So, your regex may be adjusted to

    var results = Regex.Matches(text, 
                      @"^\((?=\[)(?>\((?<c>)|[^()]+|\)(?<-c>))*\)\r?$", 
                      RegexOptions.Multiline)
                  .Cast<Match>()
                  .Select(x => x.Value)
                  .ToList();
    

    See the .NET regex demo.

    Details

    • ^ - start of a line
    • \( - a ( char
    • (?=\[) - a [ should immediately follow the current position
    • (?>\((?<c>)|[^()]+|\)(?<-c>))* - 0 or more repetitions of
      • \((?<c>)| - ( and an empty value is pushed onto `Group "c" capture stack, or
      • [^()]+| - 0 or more chars other than ( and ), or
      • \)(?<-c>) - ) and an empty value is popped from `Group "c" capture stack
    • \) - a ) char
    • \r?$ - an optional CR and end of a line.