Search code examples
.netregexnegative-lookbehind

How can I optimize this regex to get better performance?


I'm trying to optimize one of my .NET app's regular expression.

Regex: (?<!WordA\s(?:WordB\s)?)(WordB\s)?WordC

Logic:

  • Find matching WordC
  • Join WordB to match (if present right before WordC)
  • Don't match anything if WordC (even if preceeded by WordB) is preceeded by WordA

Should Match:

  • WordC
  • WordB WordC

Should Not Match:

  • WordA WordC
  • WordA WordB WordC

The expression works but as you can see the WordB is present two times in the expression so I'm trying to remove one of them to get better performance.

Note: "Words" are in fact complex expressions.

Is there any way?


Solution

  • The problem with "optimizing" the (?<!WordA\s(?:WordB\s)?)(WordB\s)?WordC regex (that is a combination of (?<!WordA\s)WordC and (?<!WordA\s)WordB\sWordC) is that WordB and WordC are separated with whitespace, and a negative lookbehind does not make the regex engine skip the matched phrase once there is WordB WordC preceded with WordA, it only skips the position where it failed, so WordC will match if you just use (?<!WordA\s)(WordB\s)?WordC. The lookbehind must restrict both WordB\sWordC and WordC that is why you must repeat the optional WordB in the lookbehind pattern, the same way you would use it in the two "destructured" patterns shown above.

    So, with a plain string regex, there is no other way.

    A workaround involving some code change can look like

    var rx = @"(WordA\s)?(?:WordB\s)?WordC";
    var strings = new List<String> {"WordC", "WordB WordC", "WordA WordC", "WordA WordB WordC"};
    foreach (var s in strings)
    {
        var m = Regex.Match(s, rx);
        Console.WriteLine("{0}: {1}", s, (m.Groups[1].Success ? "NO MATCH" : m.Value));
    }
    // => WordC: WordC
    // => WordB WordC: WordB WordC
    // => WordA WordC: NO MATCH
    // => WordA WordB WordC: NO MATCH
    

    See the C# demo.

    In the (WordA\s)?(?:WordB\s)?WordC regex, (WordA\s)? captures WordA with a whitespace is captured into Group 1, and if it matches, we know we need to discard the match. If the Group 1 .Success value is false, it means the match is valid.