Search code examples
phpmbstring

Why does PHP mb_convert_case() and mb_strtoupper() convert µ (U+00B5 MICRO SIGN) to "Μ"?


I'm trying to write my own mb_ucwords() function to proivde a quick wrapper of mb_convert_case so that it would work with multibyte strings since the base ucwords() function does not.

I have ran into an issue where a string passed in that starts with the µ character (U+00B5 MICRO SIGN) was coming back as "Μ" (U+039C GREEK CAPITAL LETTER MU) instead of being ignored as I would assume should happen.

I wrote a quick test script to verify some information:

        function testUtf8($letter) {
            echo "CHAR: " . $letter . "\n";
            echo "Detected Encoding: " . mb_detect_encoding($letter) . "\n";
            echo "IS VALID UTF-8? " . (mb_check_encoding($letter, 'UTF-8') ? 'YES' : 'NO') . "\n";
            $lower = mb_strtolower($letter);
            $upper = mb_strtoupper($letter);
            $conv = mb_convert_case($letter, MB_CASE_TITLE, 'UTF-8');
            echo "mb_strtolower(): " . $lower . "(" . mb_ord($lower) . ")\n";
            echo "mb_strtoupper(): " . $upper . "(" . mb_ord($upper) . ")\n";
            echo "mb_convert_case(): " . $conv . "(" . mb_ord($conv) . ")\n";
            echo "\n";
            echo "Matches RegEx /\p{L}/u: " . (preg_match('/\p{L}/u', $letter) ? 'YES' : 'NO') . "\n";
            echo "Matches RegEx /\p{N}/u: " . (preg_match('/\p{N}/u', $letter) ? 'YES' : 'NO') . "\n";
            echo "Matches RegEx /\p{Xan}/u: " . (preg_match('/\p{Xan}/u', $letter) ? 'YES' : 'NO') . "\n";
        }

        testUtf8('µ');

And the output I get is:

CHAR: µ
Detected Encoding: UTF-8
IS VALID UTF-8? YES
mb_strtolower(): µ(181)
mb_strtoupper(): Μ(924)
mb_convert_case(): Μ(924)

Matches RegEx /\p{L}/u: YES
Matches RegEx /\p{N}/u: NO
Matches RegEx /\p{Xan}/u: YES

Can someone explain to me why PHP thinks µ is a "letter" and why the MB uppercase version is "Μ"? I was going to work around this by testing the first letter of each word and verifying that it was a valid unicode "letter" before running the conversion, but as you can see that wont work for this character since /\p{L}/u matches that character :(

Any idea how I can work around this?

Here is the rough draft of my function:

    /**
     * @param string $string The string to convert
     * @param string $encoding Default is UTF-8
     * @param string $delim_pattern Pattern used to break $string into words
     * @return string
     */
    public static function mb_ucwords(
        string $string,
        string $encoding = 'UTF-8',
        string $delim_pattern = '/([\/\-\s\v"\'\\\]+)/u'
    ): string {
        $words = preg_split($delim_pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
        $output = "";
        foreach($words as $word) {
            $output .= mb_convert_case($word, MB_CASE_TITLE, $encoding);
        }
        return $output;
    }

Currently testing this code agasinst PHP7.4

EDIT:

Apparently this is a GREEK letter as well as the symbol for micro, and M is the capital version of said GREEK letter. I'm not sure how to handle this...


Solution

  • In Unicode 2, µ (U+00B5 MICRO SIGN) was changed to have a compatibility decomposition of μ (U+03BC GREEK SMALL LETTER MU). At the same time, its category was changed from symbol to letter, to match μ (U+03BC GREEK SMALL LETTER MU). This means that U+00B5 should not be used in new text; it is only to be used for compatibility with non-Unicode character sets. Under certain normalization forms, these are considered to be the same character.

    In Unicode 3.0, it was updated to have has M (U+039C GREEK CAPITAL LETTER MU) as its uppercase mapping, giving the result that you see now.

    Unfortunately, since µ (U+00B5 MICRO SIGN) is basically deprecated, you're on your own if you use it. You could compare the first character of the string with µ (U+00B5 MICRO SIGN) before calling mb_convert_case. However, there's no guarantee that some system won't silently convert it to μ (U+03BC GREEK SMALL LETTER MU), for example if it normalizes the string. If you will never otherwise use μ (U+03BC GREEK SMALL LETTER MU), you could special-case that character as well.

    The fail-safe way to handle this without breaking support for Greek text would be to use some sort of markup language or rich text to indicate that the character is used as a symbol instead of a letter, and then parse that when performing the case conversion. But that would obviously be a larger undertaking.