Search code examples
c#regexbacktracking

Regex catastrophic backtracking when not match


I'm testing my regex in https://regex101.com/ before doing any coding.

Regex:

\[(.*?)\]((?:.\s*)*?)\[\/\1\]

Example string:

[tag1]Test's Text Test Text Test Text Test Text.

Test Text Test Text Test "Text Test Text" Test Text Test Text.

Test Text? Test Text Test Text Test Text Test Text Test Text Test Text.

Test Text, Test Text Test Text Test Text Test Text Test Text Test Text Test Text.[/tag1]

[tag2]Test's Text Test Text Test Text Test Text.

Test Text Test Text Test "Text Test Text" Test Text Test Text.

Test Text? Test Text Test Text Test Text Test Text Test Text Test Text.

Test Text, Test Text Test Text Test Text Test Text Test Text Test Text Test Text.[/tag2]

....

....

I'm trying to capture 2 group in some long strings. the first one is the text inside the square brackets and the second one is the text inside the tag.

The regex and string above don't have any problem when the regex is match. If in match, the steps taken are only 1000+ each match. But if the opening and the closing tag are not in match, catastrophic backtracking happens and the match is finished in 126.000+ steps and stopped finding the other matching strings.

I know, to prevent backtracking problem is to avoid using nested constructs that have "+" or "*" but I can't seem to know any better way to do this.

Maybe someone can offer or suggest a better regex than mine?


Solution

  • Preface

    First thing, use appropriate testing environment. If you use a regex in .NET, do not test it at a regex tester that does not support .NET regex.

    Regex101.com does NOT support .NET regex!

    You regex pattern does not cause any catastrophic backtracking with the string you posted at RegexStorm.net.

    Root cause

    Ok, the regex pattern is really bad and inefficient. Why? The (?:.\s*)*? (enclosed in some larger pattern, as itself, standalone, it would not be a problem), matches any character followed with zero or more (thus, optional) whitespaces, all of that repeated 0 or more times, but as few as possible. So, both . and \s* can match the same string. When you wrap that in a group and add a quantifier, the total number of possible matching combinations that regex engine tries grows exponentially.

    Enhancing the pattern

    The enhacement is not so evident, but many will come with a solution like the one given by Federico: use lazy dot matching pattern. So, (?s)\[([^]]*)](.*?)\[/\1] (demo) looks a viable solution. It yields 7,843 iterations per second at RegexHero.net.

    Using an unroll the loop method, we can boost the regex performance n times depending on the input. Here, we may write the .*? subpattern as any character but a [ and any [ not followed with /\1] up to the \[/\1]. This can be written with negated character classes and a lookahead inside 1 quantified group (it won't even require any modifiers or flags):

    \[([^]]*)]([^[]*(?:\[(?!/\1])[^[]*)*)\[/\1]
    

    See this RegexStorm demo. This regex pattern yields 114,225 iterations per second. This is because there are no [ at all in between [tag1] and [/tag1], performance will get worse if a string contains a lot of [ or consists of just [ (which should not occur in real life).

    Testing

    Here is RegexHero testing:

    enter image description here

    Your original regex yielded just 5,094 ips on that site.