php c performance micro-optimization microbenchmark

Why is strtolower slightly slower than strtoupper?

I did an experiment out of curiosity. I wanted to see if there was a micro difference at all between strtolower() and strtoupper(). I expected strtolower() would be faster on mostly lowercase strings and visa versa. What I found is that strtolower() was slower in all cases (although in a completely insignificant way until you're doing it millions of times.) This was my test.

$string = 'hello world';
$start_time = microtime();
for ($i = 0; $i < 10000000; $i++) {
    strtolower($string);
}
$timed = microtime() - $start_time;
echo 'strtolower ' . $string . ' - ' . $timed . '<br>';

Repeated for strtolower() and strtoupper() with hello world, HELLO WORLD, and Hello World. Here is the full gist. I've ran the code several times and keep getting roughly the same results. Here's one run of the test below. (Done with the original source which used $i < $max = 1000000 as in the gist, so potentially extra overhead in the loop; see comments.)

strtolower hello world - 0.043829
strtoupper hello world - 0.04062
strtolower HELLO WORLD - 0.042691
strtoupper HELLO WORLD - 0.015475
strtolower Hello World - 0.033626
strtoupper Hello World - 0.017022

I believe the C code in the php-src github that controls this is here for strtolower() and here for strtoupper()

To be clear, this isn't going to prevent me from ever using strtolower(). I am only trying to understand what is going on here.

Why is strtolower() slower than strtoupper()?

Solution

It mostly depends on which character encoding you are currently using, but the main cause of the speed difference is the size of each encoded character of special characters.

Taken from babelstone.co.uk:

For example, lowercase j with caron (ǰ) is represented as a single encoded character (U+01F0 LATIN SMALL LETTER J WITH CARON), but the corresponding uppercase character (J̌) is represented in Unicode as a sequence of two encoded characters (U+004A LATIN CAPITAL LETTER J + U+030C COMBINING CARON).

More data to sift through in the index of Unicode characters will inevitably take a little longer.

Keep in mind, that strtolower uses your current locale, so if your server is using character encoding that does not support strtolower of special characters (such as 'Ê'), it will simply return the special character. The character mapping on UTF-8 is however set up, which can be confirmed by running mb_strtolower.

There is also the possibility of comparing the number of characters that fall into the category of uppercase vs the amount you will find in the lowercase category, but once again, that is dependent on your character encoding.

In short, strtolower has a bigger database of characters to compare each individual string character to when it checks whether or not the character is uppercase.