Search code examples
phpregexperformancestrpos

Why is mb_strpos() so considerably slower than strpos()?


I had criticized an answer that suggested preg_match over === when finding substring offsets in order to avoid type mismatch.

However, later on the answer's author has discovered that preg_match is actually significantly faster than multi-byte operating mb_strpos. Normal strpos is faster than both functions but of course, cannot deal with multibyte strings.

I understand that mb_strpos needs to do something more than strpos. However, if regex can do it almost as fast as strpos, what is it that mb_strpos does that takes so much time?

I have strong suspicion that it's an optimization error. Could, for example, PHP extensions be slower than its native functions?

mb_strpos($str, "颜色", 0 ,"GBK"): 15.988190889 (89%)
preg_match("/颜色/", $str): 1.022506952 (6%)
strpos($str, "dh"): 0.934401989 (5%)

Functions were run 106 times. The absolute time(s) accounts for the sum of time of 106 runs of a function, rather than average for one.

The test string is $str = "代码dhgd颜色代码";. The test can be seen here (scroll down to skip the testing class).

Note: According to one of the commentators (and common sense), preg_match also does not use multi-byte when comparing, being subject to same risk of errors as strpos.


Solution

  • To understand why the functions have a different runtime you need to understand what they actually do. Because summing them up as ‘they search for needle in haystack’ isn’t enough.

    strpos

    If you look at the implementation of strpos, it uses zend_memstr internally, which implements a pretty naive algorithm for searching for needle in haystack: Basically, it uses memchr to find the first byte of needle in haystack and then uses memcmp to check whether the whole needle begins at that position. If not, it repeats the search for the first byte of needle from the position of the previous match of the first byte.

    Knowing this, we can say that strpos does only search for a byte sequence in a byte sequence using a naive search algorithm.

    mb_strpos

    This function is the multi-byte counterpart to strpos. This makes searching a little more complex as you can’t just look at the bytes without knowing to which character they belong to.

    mb_strpos uses mbfl_strpos, which does a lot more in comparison to the simple algorithm of zend_memstr, it’s like 200 lines of complex code (mbfl_strpos) compared to 30 lines of slick code (zend_memstr).

    We can skip the part where both needle and haystack are converted to UTF-8 if necessary, and come to the major chunk of code.

    First we have two setup loops and then there is the loop that proceeds the pointer according to the given offset where you can see that they aware of actual characters and how they skip whole encoded UTF-8 characters: since UTF-8 is a variable-width character encoding where the first byte of each encoded character denotes the whole length of the encoded character. This information is stored in the u8_tbl array.

    Finally, the loop where the actual search happens. And here we have something interesting, because the test for needle at a certain position in haystack is tried in reverse. And if one byte did not match, the jump table jtbl is used to find the next possible position for needle in haystack. This is actually an implementation of the Boyer–Moore string search algorithm.

    So now we know that mb_strpos

    • converts the strings to UTF-8, if necessary
    • is aware of actual characters
    • uses the Boyer–Moore search algorithm

    preg_match

    As for preg_match, it uses the PCRE library. Its standard matching algorithm uses a nondeterministic finite automaton (NFA) to find a match conducting a depth-first search of the pattern tree. This is basically a naive search approach.