Search code examples
phpregexperformancephp-8php-8.1

/\s+$/u performs really bad if string contains many spaces in the middle


Consider this string (notice the horizontal scroll - the string is long):

$content = 'Xxxxxx xx xxxx xxxxxx/xxxx xxxxxxx xx xxxxx xx xxx   XXXXXXX/XXXXX XXXX   XXXXXXX XXXX   XXXXXX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               XXXXX XXXXXX   XXXXXX   XXXXXX XXXXX   XXXXXX';

I have my own mb_trim() function to support unicode strings, but I found it's performing really bad for this string in specific.

After debugging, I realized that it's just the "end-of-string" bit that doesn't perform, while "beginning-of-string" is fine.

So, just doing this (minimal code):

$trim = preg_replace('/\s+$/u', '', $content);

This takes 2s ~ 3s.
But even without the u modifier, it still takes ~1.60s.

If I replace the spaces in the middle with some letter, the preg_replace will take 0s.

Is there a way to fix this performance issue?

It's funny that if I run this:

$trim = preg_replace('/\s{2,}/u', ' ', $content);
$trim = preg_replace('/\s+$/u', '', $trim);

This will run fast.
But I don't understand why are the spaces in the middle of the string a problem for an "end-of-string" regex. I'd think it would be optimized in a way that it would only look at the end of the string and not in the middle.

--

UPDATE - This seems to take the 2s on the server running AlmaLinux (even though it has a very good CPU and RAM) and on a Docker container running CentOS 7 on a Windows. But if I run the script on the Windows itself, it runs instantly. It also runs fast on 3v4l.

I tried on another Linux host running PHP 7.4, and it took 5.4s.

I wonder what could be causing the hang on the Linux systems above?


Solution

  • Late comes my answer! The bad performance is a result from a lot of backtracking if the input contains many whitespaces which are not at the end. Looks like even Stack Overflow faced some similar problem. I stumbled upon this old blogpost recently: The RegEx that killed StackOverflow

    The most obvious and first idea to use a possessive quantifier for reducing backtracking:

    \s++$
    

    Using this demo-input at regex101 it shows almost hundert times fewer steps than without.


    With a little trick to let consume everything until the last \S non-whitespace even better:

    ^(?s)(?>.*\S\K)?\s+$
    

    Getting it down to just 20 steps from initially more than 200k steps with the demo input.

    • The s flag (single line) makes the dot also match newlines
    • (?> atomic group ) for failing fast if no whitespace at the end
    • \K resets beginning of the reported match after the last \S

    Here a little PHP benchmark at tio.run for comparing the performance of all three variants.
    An alternative for other regex flavors can be to replace ^([\S\s]*\S)?\s*$ with $1 (or \1).