Search code examples
c#regexperformanceregex-lookarounds.net-8.0

regular expressions: what's the best option for matching urls with several or expressions


suppose you're trying to match an urls which are compatible with the following 2 rules:

  1. ends with the word .../contactos(/)
  2. or has the word .../gestao/... in the middle of the url

Initially, I've started with the following pattern (pattern 1):

^(?:(?:/.+)+/contactos/?(?!.))|(?:(?:/.*)+/gestao(?:/.+))$

It works, but the negative lookahead which is required in this case seemed out of place, so I've replaced with the following pattern (lets call it pattern 2):

^(?:(?:/.+)+/contactos/?)$|^(?:(?:/.*)+/gestao(?:/.+))$

The main difference is that I'm using begin/end anchors to delimit each expression instead of having a single pair like in the first example (I'm not capturing groups in both cases, so that's really the only difference between both patterns). By "isolating" each "option", I can discard the negative lookahead used on pattern 1.

Since pattern 2 doesn't use lookaheads, I expected that it should be faster than pattern 1. However, that doesn't seem to be the case. I've built a small benchmark with .net 8 and it seems like pattern 1 is slightly faster than pattern for matching urls. Here's the code I've written (not really an experienced Benchmark user, so maybe I'm missing something):

[MemoryDiagnoser]
public class RegexPerformance {
    
    private const string _patternNoCapture = "^(?:(?:/.+)+/contactos/?(?!.))|(?:(?:/.*)+/gestao(?:/.+))$";
    private const string _patternNoCaptureWithStartEndAnchors = "^(?:(?:/.+)+/contactos/?)$|^(?:(?:/.*)+/gestao(?:/.+))$";

    private Regex _noCapture = default!;

    private Regex _noCaptureWithStart = default!;

    [GlobalSetup]
    public void Setup() {
        _noCapture = new (_patternNoCapture,
                               RegexOptions.Compiled | RegexOptions.IgnoreCase);
        
        _noCaptureWithStart = new (_patternNoCaptureWithStartEndAnchors,
                                        RegexOptions.Compiled | RegexOptions.IgnoreCase);
    }
    
    public IEnumerable<string> Urls => [
        "/test/contactos",
        "/test/contactos/",
        "/test/someplace/gestao/123/yyy",
        "/test/someplace/gestao/123"
    ];
    
    [Benchmark]
    [ArgumentsSource(nameof(Urls))]
    public bool MatchPatternNoCapture(string url) => _noCapture.IsMatch(url);
    
    [Benchmark]
    [ArgumentsSource(nameof(Urls))]
    public bool MatchPatternNoCaptureWithStartEndAnchors(string url) => _noCaptureWithStart.IsMatch(url);


}

And here are the results:


| Method                                   | url                  | Mean       | Error     | StdDev    | Median     | Allocated |
|----------------------------------------- |--------------------- |-----------:|----------:|----------:|-----------:|----------:|
| MatchPatternNoCapture                    | /test/contactos      |   244.5 ns |   8.25 ns |  23.53 ns |   238.6 ns |         - |
| MatchPatternNoCaptureWithStartEndAnchors | /test/contactos      |   224.3 ns |   3.14 ns |   2.62 ns |   223.5 ns |         - |
| MatchPatternNoCapture                    | /test/contactos/     |   256.6 ns |   5.67 ns |  15.80 ns |   259.9 ns |         - |
| MatchPatternNoCaptureWithStartEndAnchors | /test/contactos/     |   278.6 ns |  18.98 ns |  53.84 ns |   260.6 ns |         - |
| MatchPatternNoCapture                    | /test(...)o/123 [26] | 1,081.5 ns |  37.01 ns | 104.39 ns | 1,057.7 ns |         - |
| MatchPatternNoCaptureWithStartEndAnchors | /test(...)o/123 [26] | 1,367.4 ns | 167.26 ns | 493.18 ns | 1,121.6 ns |         - |
| MatchPatternNoCapture                    | /test(...)3/yyy [30] | 1,861.4 ns |  73.08 ns | 201.28 ns | 1,789.4 ns |         - |
| MatchPatternNoCaptureWithStartEndAnchors | /test(...)3/yyy [30] | 1,925.6 ns |  58.13 ns | 168.65 ns | 1,919.0 ns |         - |

Btw, I've run another simple test with a non-matching url (/test/contactosbad)and in this case it seems like pattern 2 behaves slghtly better than pattern 1:

| Method                                   | url                | Mean     | Error    | StdDev   | Allocated |
|----------------------------------------- |------------------- |---------:|---------:|---------:|----------:|
| MatchPatternNoCapture                    | /test/contactosbad | 668.4 ns | 13.44 ns | 14.38 ns |         - |
| MatchPatternNoCaptureWithStartEndAnchors | /test/contactosbad | 518.2 ns | 10.30 ns | 14.43 ns |         - |

I must admit that I'm a little puzzled with these results. I expected pattern 2 to behave much better than pattern 1, but that doesn't seem to be the case. Any ideas on why this happens?

Thanks.


Solution

  • Neither of your patterns are good: they are prone to catastrophic backtracking.

    Namely, the quantified groups with all-mighty dots: (?:/.+)+ and (?:/.*)+. These will take a long time to run on inputs which are just a bit longer. Here's an 115-characters input suggested by a ReDoS checker:

    /stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao/stao
    

    This triggered a timeout on regex101.com after about 10 seconds (it might be a bit faster on your machine, since regex101 seems to be running the regex on client side). To put it in perspective, the time complexity of such a group is O(2^n) (exponential).

    Instead, keep it simple:

    ^
    (?:
      /.+/contactos/?$
    |
      /.+/gestao/.+
    )
    

    The first .+ runs till the end in 1 step, then backtracks until it finds the fixed /contactos (n steps); the second .+ does the same (n steps). The third do not need to backtrack, so it also takes only 1 step; you can replace it with a single . if you just want a boolean and not a full match. Collectively, the time complexity of this pattern is O(n) (linear).

    I haven't run a benchmark yet (consulting ChatGPT on how to write a proper C# program) but I expect this to outperform your patterns both in performance and readability. As an example, regex101 reports no match for an input of ten times longer than the input above in just milliseconds.

    Addendum

    The pattern above can also be written as:

    ^/.+/(?:contactos/?$|gestao/.+)
    

    The engine will only try out the branches whenever .+ backtracks to a /. This, therefore, will take less steps, but still O(n).