Search code examples
c#.netregex.net-5

Random RegexMatchTimeoutException exceptions


I'm sometimes having RegexMatchTimeoutException when parsing a short (less than 100 characters) string. The parse itself is inside a function in a list.Select(..) of a collection of about 30 elements.

I suspect it may be due to sub-optimal Regex - here's the definition in C#:

internal override Regex Regex => new(
    @$"((.|\s)*\S(.|\s)*)(\[{this.Type}\])",             // Type = "input"
    RegexOptions.Multiline | RegexOptions.Compiled, 
    TimeSpan.FromMilliseconds(Constants.RegexTimeout));  // RegexTimeout = 100

It should capture Sample text in the following string:

Sample text
[input]

Full exception message:

System.Text.RegularExpressions.RegexMatchTimeoutException: 'The Regex engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors.'

Line in which the exception occurs:

var label = this.Regex.Match(sectionContent).Groups[1].Value.Trim();

The exception is rather hard to reproduce - with the same input it can happen on the first run or on the 100th. But the bigger the collection of lines to run the Regex against, the bigger the chance of it occurring.


Solution

  • Your ((.|\s)*\S(.|\s)*)(\[input\]) regex matches

    • ((.|\s)*\S(.|\s)*) - Group 1:
      • (.|\s)* - zero or more occurrences of any char other than a newline (.) or (|) any whitespace char (\s)
      • \S - a non-whitespace chars
      • (.|\s)* - zero or more occurrences of any char other than a newline (.) or (|) any whitespace char (\s)
    • (\[input\]) - Group 2: [input].

    You can't but notice that Group 1 patterns each can match the same characters. \S is the only "anchoring" pattern here, it requires a single non-whitespace char, and since both patterns before and after \S are meant to match any text, the most efficient logic is: match any amount of whitespaces, then a non-whitespace char, and then any amount of chars (as few as possible but as many as necessary) up to [input].

    Here is the fix:

    internal override Regex Regex => new(
        @$"(?s)(\s*\S.*?)(\[{Regex.Escape(this.Type)}])",             // Type = "input"
        RegexOptions.Compiled, 
        TimeSpan.FromMilliseconds(Constants.RegexTimeout));  // RegexTimeout = 100`
    

    Note the this.Type can be escaped just in case there are any special chars in it. (?s) is an inline modifier version of the RegexOptions.Singleline option (use them interchangeably).