I have following code:
void Main()
{
string template = @"
aaa
{begin iteration items}
bbbbbb
{begin iteration subitems}
ccccccc
{end iteration subitems}
ddddddddd
{begin iteration items}
hhhhhhhhhhhhhhhhh
{end iteration items}
iiiiiiiiiiiiiiiiiiiiiiiiiiii
{end iteration items}
eeeeeeeeeeeeeeee
{begin iteration items}
ffffff
{end iteration items}
gggggggggggg
";
string re = @"
\{\s*begin\s+iteration\s+items\s*}
(?<template>
(
(?<iteration>\{\s*begin\s+iteration\s+items\s*})
|(?<-iteration>\{\s*end\s+iteration\s+items\s*})
|((?!(\{\s*begin\s+iteration\s+items\s*})|(\{\s*end\s+iteration\s+items\s*})).*?)
)*(?(iteration)(?!))
)
\{\s*end\s+iteration\s+items\s*}
";
Regex r = new Regex(re, RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
var matches = r.Matches(template);
matches.Dump();
}
When template
is balanced then matches are returned and all is ok.
But when I change {end iteration items}
to {end1 iteration items}
after iiiiiiiiiiiiiii
line in template, then code stops to respond on matches.Dump()
line (Dump()
is an extension method to read/enumerate in LinQPad)
What is wrong? Is it possible to rewrite Regex so that it always respond?
EDIT
My goal is to capture all top level <template>
groups if syntax is valid, or capture nothing if not.
I tried non-backtracking groups as Lucas adviced, but there no captures now when syntax is valid.
You're experiencing catastrophic backtracking here.
In short: a pattern in the form of ((something)*)*
with nested quantifiers will trigger it, because the engine has to try all the possible combinations if a match can't be found right away.
You can use an atomic group to guard against it. The following should do the trick:
\{\s*begin\s+iteration\s+items\s*}
(?<template>
(?>
(?<iteration>\{\s*begin\s+iteration\s+items\s*})
|(?<-iteration>\{\s*end\s+iteration\s+items\s*})
|[^{]+
|\{
)*(?(iteration)(?!))
)
\{\s*end\s+iteration\s+items\s*}
Or use ((?>
...))
instead of (?>
...)
if you need capturing.
I simplified the expression - there's no need for the lookahead anymore when using an atomic group, since these cases would be handled by the iteration
groups. The last part of the alternative (\{
) is here to account for lone opening braces, which are not part of the begin/end sequence. Most of the text is consumed by [^{]+
inside the atomic group, so backtracking cannot happen.