Search code examples
phpstringmultibyte-functions

PHP string functions: which ones will work with UTF-8?


In the PHP documentation string functions are listed that work on byte level. This works for SBCS strings, but not for MBCS strings. Luckily one famous encoding UTF-8 is backward compatible up to 7 bit US-ASCII.

Since PHP 5.6 the default encoding has changed to UTF-8, but it's string functions have not. The well known alternatives are iconv, Multibyte String and Intl. Also PCRE functions can be MBCS compliant when compiled in the right way.

When SBCS code of age needs to be transformed to VMBCS (UTF-8) compliance, the standard PHP byte string functions needs to be rewritten to be MBCS safe. Although the most basic functions (like strpos()) have an mb_* variant (like mb_strpos()) most of PHP's string functions have no mb_ counterpart. For continued use they have to be rewritten.

In the first stage, one needs to determine which SBCS string functions will work despite their byte oriented nature. Some have been identified already on SO, what I'm looking for now is a comprehensive list of functions that will work with UTF-8, or when used with caution, for example parameters with US-ASCII only. To clarify, the question is not about the byte string functions like chr() or crc32(), it's about getting a list of functions like:

  • Not safe: count_chars() counts bytes, ...
  • Caution: ltrim() will work as long as parameters are US-ASCII, ...
  • Safe: str_repeat() will work with MBCS strings, ...

Would anybody know such a list?


Solution

  • Core PHP SBCS string functions

    Assuming the default encoding of PHP is set to UTF-8, these string functions will work:

    Unfortunately all other string functions do not work with UTF-8. Obstacles:

    • case handling or spaces does not work with UTF-8
    • string lengths in parameters and return values are not in character lengths
    • string processing causes data corruption
    • string function is comletely ASCII oriented

    In some cases functions can work as expected when parameters are US-ASCII and lengths are byte lenghts.

    Binary string function are still useful:

    • bin2hex Convert binary data into hexadecimal representation
    • chr Return a specific character (=byte)
    • convert_uudecode Decode a uuencoded string
    • convert_uuencode Uuencode a string
    • crc32 Calculates the crc32 polynomial of a string
    • crypt One-way string hashing
    • hex2bin Decodes a hexadecimally encoded binary string
    • md5_file Calculates the md5 hash of a given file
    • md5 Calculate the md5 hash of a string
    • ord Return ASCII value of character (=byte)
    • sha1_file Calculate the sha1 hash of a file
    • sha1 Calculate the sha1 hash of a string

    Configuration functions do not apply:

    Regular expression functions and encoding and transcoding functions are not considered.

    Extentions

    In quite a few cases, Multibyte String offers an UTF-8 variant:

    • mb_convert_case Perform case folding on a string
    • mb_parse_str Parse GET/POST/COOKIE data and set global variable
    • mb_split Split multibyte string using regular expression
    • mb_strcut Get part of string
    • mb_strimwidth Get truncated string with specified width
    • mb_stripos Finds position of first occurrence of a string within another, case insensitive
    • mb_stristr Finds first occurrence of a string within another, case insensitive
    • mb_strlen Get string length
    • mb_strpos Find position of first occurrence of string in a string
    • mb_strrchr Finds the last occurrence of a character in a string within another
    • mb_strrichr Finds the last occurrence of a character in a string within another, case insensitive
    • mb_strripos Finds position of last occurrence of a string within another, case insensitive
    • mb_strrpos Find position of last occurrence of a string in a string
    • mb_strstr Finds first occurrence of a string within another
    • mb_strtolower Make a string lowercase
    • mb_strtoupper Make a string uppercase
    • mb_strwidth Return width of string
    • mb_substr_count Count the number of substring occurrences
    • mb_substr Get part of string

    And iconv provides a bare minimum of string functions:

    Lastly Intl has a lot of extra and powerful Unicode features (but no regular expressions) as part of i18n. Some features overlap with other string functions. With respect to string functions these are: