Search code examples
phppreg-matchpreg-match-allstrposphp-8.1

preg_match is faster than strpos on large text?


I am currently updating very old script written for PHP 5.2.17 for PHP 8.1.2. There is a lot of text processing code blocks and almost all of them are preg_match/preg_match_all. I used to know, that strpos for string matching have always been faster than preg_match, but I decided to check one more time.

Code was:

$c = file_get_contents('readme-redist-bins.txt');
$start = microtime(true);
for ($i=0; $i < 1000000; $i++) { 
    strpos($c, '[SOMEMACRO]');
}
$el = microtime(true) - $start;
exit($el);

and

$c = file_get_contents('readme-redist-bins.txt');
$start = microtime(true);
for ($i=0; $i < 1000000; $i++) { 
    preg_match_all("/\[([a-z0-9-]{0,100})".'[SOMEMACRO]'."/", $c, $pma);
}
$el = microtime(true) - $start;
exit($el);

I took readme-redist-bins.txt file which comes with php8.1.2 distribution, about 30KB.

Results(preg_match_all):

PHP_8.1.2: 1.2461s
PHP_5.2.17: 11.0701s

Results(strpos):

PHP_8.1.2: 9.97s
PHP_5.2.17: 0.65s

Double checked... Tried Windows and Linux PHP builds, on two machines.

Tried the same code with small file(200B)

Results(preg_match_all):

PHP_8.1.2: 0.0867s
PHP_5.2.17: 0.6097s

Results(strpos):

PHP_8.1.2: 0.0358s
PHP_5.2.17: 0.2484s

And now the timings is OK.

So, how cant it be, that preg_match is so match faster on large text? Any ideas?

PS: Tried PHP_7.2.10 - same result.


Solution

  • PCRE2 is really fast. It's so fast that there usually is barely any difference between it and plain string processing in PHP and sometimes it's even faster. PCRE2 internally uses JIT and contains a lot of optimizations. It's really good at what it does.

    On the other hand, strpos is poorly optimized. It's doing some simple byte comparison in C. It doesn't use parallelization/vectorization. For short needles and short haystacks, it uses memchr, but for longer values, it performs Sunday Algorithm.

    For small datasets, the overhead from calling PCRE2 will probably outweigh its optimizations, but for larger strings, or case-insensitive/Unicode strings PCRE2 might offer better performance.