Search code examples
phpmbstring

What is the purpose of the MB_CASE_*_SIMPLE constants?


According to the manual, the following constants have been added in PHP 7.3:

  • MB_CASE_FOLD
  • MB_CASE_LOWER_SIMPLE
  • MB_CASE_UPPER_SIMPLE
  • MB_CASE_TITLE_SIMPLE
  • MB_CASE_FOLD_SIMPLE

I found an example of what MB_CASE_FOLD does:

echo mb_convert_case('ẞ', MB_CASE_FOLD, 'UTF-8'); // ss

However, I could not find any reference to what the MB_CASE_*_SIMPLE constants do.

At first glance, with simple latin1 characters, MB_CASE_LOWER_SIMPLE behaves just like MB_CASE_LOWER.

What do the MB_CASE_*_SIMPLE do different from their MB_CASE_* counterparts?


Solution

  • We can find the corresponding C implementation at https://github.com/php/php-src/blob/master/ext/mbstring/php_unicode.c#L223

    And have a look at the git commit message:

    • Full case folding is implemented, but case-insensitive mb_* operations continue to use simple case folding. The reason is that full case folding of the haystack string may change the position at which a match occurred. This would have to be mapped back into the position in the original string.

    • mb_convert_case() exposes both the full and the simple case mapping / folding, where full is the default. The constants are:

      • MB_CASE_LOWER (used by mb_strtolower)
      • MB_CASE_UPPER (used by mb_strtolower)
      • MB_CASE_TITLE
      • MB_CASE_FOLD
      • MB_CASE_LOWER_SIMPLE
      • MB_CASE_UPPER_SIMPLE
      • MB_CASE_TITLE_SIMPLE
      • MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)

    So those constants with _SIMPLE suffix are for Unicode's Simple Case Folding, and those WITHOUT the suffix are for Full Case Folding.

    And that answers the differences on Full Case Folding vs Simple Case Folding.