Search code examples
.netregexxmlcdataparentheses

What is the balance matching regular expression for removing nested brackets composed of sets of ordered characters?


Following this question:

https://stackoverflow.com/a/24591578/1329812

I am trying to use balanced matching to replace all items within brackets but in the example the brackets are "{{" and "}}". Whereas my brackets would be "<![CDATA[" and "]]>".

I am having trouble modifying the [^{}] section of the regular expression in the accepted answer to the previous question to use my version of brackets instead. I have tried to modify [^{}] to (?!(<!\[CDATA\|\]\]>)).

I have simplified the problem to use 12 as the open bracket and 34 as the close bracket. The following returns "STST" as expected.

using System.Text.RegularExpressions;

Regex.Replace(
12T1212E343434STST12RING34',--input
'12(?!(12|34))*(((?<Open>12)(?!(12|34))*)+((?<Close-Open>34)(?!(12|34))*)+)*(?(Open)(?!))34',--pattern
''--replacement
);

However it does not work if i replace 12 with <!\[CDATA\[" and 34 with "\]\]>.

Finally, I would like to operate on the following CDATA Sample String:

"<![CDATA[t<![CDATA[e]]>]]>stst<![CDATA[ring]]>"

should return

"stst"

Solution

  • Your current 12...34 matching regex is not right since the tempered greedy token used is "corrupt" ((?!(12|34))* is missing the consuming part, .).

    You just need to remember about the parts of the regex like that: 1) the leading delimiter pattern, 2) the trailing delimiter pattern, 3) the part in between should match what is not both 1 and 2, 4) the conditional construct that checks if the "technical" group capture stack is empty.

    So, the numeric regex can be fixed as

    12(?>(?!12|34).|(?<o>)12|(?<-o>)34)*(?(o)(?!))34
    

    (regex demo) and the CDATA one will look like

    <!\[CDATA\[(?>(?!<!\[CDATA\[|]]>).|(?<o>)<!\[CDATA\[|(?<-o>)]]>)*(?(o)(?!))]]>
    

    See this regex demo

    NOTE: If there can be newline symbols in the string input, use RegexOptions.Singleline option or the inline modifier version, (?s), at the pattern start.

    Pattern details:

    • 12 - the leading delimiter pattern
    • (?> - start of the atomic group that will match what is neither leading nor trailing patterns, and will keep track of those delimiting substrings:
      • (?!12|34).| - match any char (if RegexOptions.Singleline option is used, even including a newline) but a char that is a starting point of the 12 or 34 sequences
      • (?<o>)12| - match12` and increment the "o" group capture stack, or
      • (?<-o>)34 - match 34 and decrement the "o" group capture stack
    • )* - and repeat that (keep matching) zero or more occurrences of the patterns inside the atomic group
    • (?(o)(?!)) - the conditional construct that will check if the "o" group capture stack is empty. If it is not empty, backtracking will trigger, and balanced number of leading/trailing delimiters will be searched for.
    • 34 - the trailing delimiter pattern.

    Also, [ in <![CDATA[ must be escaped, as [ is a special char outside the character class, and ] in ]]> do not have to be escaped, since outside a character class, ] is not special for a .NET regex.