Consider this string (notice the horizontal scroll - the string is long):
$content = 'Xxxxxx xx xxxx xxxxxx/xxxx xxxxxxx xx xxxxx xx xxx XXXXXXX/XXXXX XXXX XXXXXXX XXXX XXXXXX XXXXX XXXXXX XXXXXX XXXXXX XXXXX XXXXXX';
I have my own mb_trim()
function to support unicode strings, but I found it's performing really bad for this string in specific.
After debugging, I realized that it's just the "end-of-string" bit that doesn't perform, while "beginning-of-string" is fine.
So, just doing this (minimal code):
$trim = preg_replace('/\s+$/u', '', $content);
This takes 2s ~ 3s.
But even without the u
modifier, it still takes ~1.60s.
If I replace the spaces in the middle with some letter, the preg_replace
will take 0s.
Is there a way to fix this performance issue?
It's funny that if I run this:
$trim = preg_replace('/\s{2,}/u', ' ', $content);
$trim = preg_replace('/\s+$/u', '', $trim);
This will run fast.
But I don't understand why are the spaces in the middle of the string a problem for an "end-of-string" regex. I'd think it would be optimized in a way that it would only look at the end of the string and not in the middle.
--
UPDATE - This seems to take the 2s on the server running AlmaLinux (even though it has a very good CPU and RAM) and on a Docker container running CentOS 7 on a Windows. But if I run the script on the Windows itself, it runs instantly. It also runs fast on 3v4l.
I tried on another Linux host running PHP 7.4, and it took 5.4s.
I wonder what could be causing the hang on the Linux systems above?
Late comes my answer! The bad performance is a result from a lot of backtracking if the input contains many whitespaces which are not at the end. Looks like even Stack Overflow faced some similar problem. I stumbled upon this old blogpost recently: The RegEx that killed StackOverflow
The most obvious and first idea to use a possessive quantifier for reducing backtracking:
\s++$
Using this demo-input at regex101 it shows almost hundert times fewer steps than without.
With a little trick to let consume everything until the last \S
non-whitespace even better:
^(?s)(?>.*\S\K)?\s+$
Getting it down to just 20 steps from initially more than 200k steps with the demo input.
s
flag (single line) makes the dot also match newlines(?>
atomic group )
for failing fast if no whitespace at the end\K
resets beginning of the reported match after the last \S
Here a little PHP benchmark at tio.run for comparing the performance of all three variants.
An alternative for other regex flavors can be to replace ^([\S\s]*\S)?\s*$
with $1
(or \1
).