regex algorithm performance-testing analysis

Worst input for given regular expression

I want to automate testing of regular expressions in my code base.

I'd like to protect against (a+)+ evil regexps and their kin.

For that I'm looking for an approach or existing library that generates "worst case" inputs for a given regular expression and engine (both NFA and DFA-based engines are in scope).

Granted, regular expression is a powerful language and it's clearly [computationally] hard to find the worst input for arbitrary regular expression, esp. if back references are used, perhaps it might even be undecidable.

For my use-case, I'm fine with finding inputs that are terrible (as opposed to worst possible), yet quite short.

Solution

The worst input for a regular expression will vary from engine to engine. The same regex and string may take no time at all on one engine, but never finish on another.

Differences between engines

Engine Type

For certain engines, the "evilest" regex is still benign, running in linear time (or O(n*m) time when both the length of the regex and the length of the string may vary.) Of course, the reason for this is the implementation. These engines don't backtrack; instead they use a finite state machine (FSM).

Note that some backtracking implementations use FSM, but only as an intermediate step. Don't let this confuse you; they're not FSM.

Most of the old regex engines (like sed) use FSM matching. There are a few new engines that use this implementation, such as Go. PCRE even has DFA functions (search for "DFA" here) that use this type of matching.

Another answer of mine also addresses the potential speed difference between the two implementations.

It would be wise to consider using a FSM implementation if you are really worried about malicious input affecting the speed of your regex. Unfortunately, FSM is not as powerful as the other implementation; it lacks support for some features, such as back references.

Optimizations

Evil is actually a bit subjective. Something evil to one regex engine may not be evil to a different engine. An evil plot can be thwarted if the engine is optimized. Optimizations are particularly important to backtracking engines, given their potential exponential run time.

Short-circuiting

Under certain conditions, the engine may be able to quickly determine a match is impossible. While running the regex a*b?a*x against the string aaaaaaaaaaaaaaaaaaaaaaaaaa, Regex101 says:

Your match failed outright. What this means is the engine, due to its internal optimizations, understood that your pattern would never match at any position, and thus did not even attempt to.

Keep in mind that Wikipedia says the regex is evil, especially when paired with that string.

Of course, the engine is smart to not need to backtrack to determine the match wouldn't work. It saw something pretty obvious: the regex needs an x in order to match, but no x was present in the string.

Modifiers

I mention this because you might not expect modifiers to be a factor in regex performance. But they are.

Even PCRE, one of the more optimized implementations, may take considerably more steps with both the u and i modifiers enabled. See my question here for more information about this. In the end, I figured out that only certain characters trigger this behavior.

Analyzing Strings

String length

In general, a long string will be slower than a short string. In fact, if you find a string of length x that causes catastrophic backtracking, you can make it backtrack a bit more by increasing the length of the string.

Greedy vs. Lazy

Compare the speeds of these regexes:

.*b on aaaaaaaa...aaaaab
.*?b on aaaaaaaa...aaaaab
.*b on abaaaaaaa...aaaaa
.*?b on abaaaaaaa...aaaaa

Essentially, greedy matching is best when you think you will need to match a lot. Lazy matching is best when you need to match only a little.

Note that if you change the regex to a*b or a*?b, then the engine may optimize things considerably.

Brute force testing

There are several frameworks that are specifically designed to try to find vulnerabilities in your regexes. It may be worthwhile to try one out.

There's really one thing that I will suggest if you wanted to try making your own algorithm. It's not practical to try all characters in the dictionary, especially if you want to test long strings.

Instead, look at your regex to determine what characters you should test. If you have (a+)+ as your regex, there are really only two things that go into the match: a and not a. You could really just imagine that there are only two characters: a and b (aka not a) when you generate your strings to brute force with.

Setting timeouts

It would be fantastic to be able to ensure your regex finishes before the heat death of the universe, right? Some regex engines do have a way to set a time out.

.NET:

AppDomain domain = AppDomain.CurrentDomain;
  // Set a timeout interval of 2 seconds.
  domain.SetData("REGEX_DEFAULT_MATCH_TIMEOUT", TimeSpan.FromSeconds(2));

Java

Runnable runnable = new Runnable() {
     @Override
     public void run()
     {
        long startTime = System.currentTimeMillis();
        Matcher interruptableMatcher = pattern.matcher(new InterruptibleCharSequence(input));
        interruptableMatcher.find(); // runs for a long time!
        System.out.println("Regex took:" + (System.currentTimeMillis() - startTime) + "ms");
     }
  };
  Thread thread = new Thread(runnable);
  thread.start();
  Thread.sleep(500);
  thread.interrupt();

PHP

bool set_time_limit ( int $seconds )

Set the number of seconds a script is allowed to run. If this is reached, the script returns a fatal error. The default limit is 30 seconds or, if it exists, the max_execution_time value defined in the php.ini.

When called, set_time_limit() restarts the timeout counter from zero. In other words, if the timeout is the default 30 seconds, and 25 seconds into script execution a call such as set_time_limit(20) is made, the script will run for a total of 45 seconds before timing out.

Perl

You might as well visit the link, since it's right on Stack Overflow.