In the PHP documentation string functions are listed that work on byte level. This works for SBCS strings, but not for MBCS strings. Luckily one famous encoding UTF-8 is backward compatible up to 7 bit US-ASCII.
Since PHP 5.6 the default encoding has changed to UTF-8, but it's string functions have not. The well known alternatives are iconv, Multibyte String and Intl. Also PCRE functions can be MBCS compliant when compiled in the right way.
When SBCS code of age needs to be transformed to VMBCS (UTF-8) compliance, the standard PHP byte string functions needs to be rewritten to be MBCS safe. Although the most basic functions (like strpos()
) have an mb_*
variant (like mb_strpos()
) most of PHP's string functions have no mb_
counterpart. For continued use they have to be rewritten.
In the first stage, one needs to determine which SBCS string functions will work despite their byte oriented nature. Some have been identified already on SO, what I'm looking for now is a comprehensive list of functions that will work with UTF-8, or when used with caution, for example parameters with US-ASCII only. To clarify, the question is not about the byte string functions like chr()
or crc32()
, it's about getting a list of functions like:
count_chars()
counts bytes, ...ltrim()
will work as long as parameters are US-ASCII, ...str_repeat()
will work with MBCS strings, ...Would anybody know such a list?
Assuming the default encoding of PHP is set to UTF-8, these string functions will work:
echo
Output one or more stringshtml_entity_decode
Convert all HTML entities to their applicable charactershtmlentities
Convert all applicable characters to HTML entities | better usehtmlspecialchars_decode
Convert special HTML entities back to charactershtmlspecialchars
Convert special characters to HTML entitiesimplode
Join array elements with a stringjoin
Alias of implodenl2br
Inserts HTML line breaks before all newlines in a stringprint
Output a stringquotemeta
Quote meta charactersstr_repeat
Repeat a stringstrip_tags
Strip HTML and PHP tags from a stringstripcslashes
Un-quote string quoted with addcslashesstripslashes
Un-quotes a quoted stringUnfortunately all other string functions do not work with UTF-8. Obstacles:
In some cases functions can work as expected when parameters are US-ASCII and lengths are byte lenghts.
Binary string function are still useful:
bin2hex
Convert binary data into hexadecimal representationchr
Return a specific character (=byte)convert_uudecode
Decode a uuencoded stringconvert_uuencode
Uuencode a stringcrc32
Calculates the crc32 polynomial of a stringcrypt
One-way string hashinghex2bin
Decodes a hexadecimally encoded binary stringmd5_file
Calculates the md5 hash of a given filemd5
Calculate the md5 hash of a stringord
Return ASCII value of character (=byte)sha1_file
Calculate the sha1 hash of a filesha1
Calculate the sha1 hash of a stringConfiguration functions do not apply:
get_html_translation_table
Returns the translation table used by htmlspecialchars and htmlentitieslocaleconv
Get numeric formatting informationnl_langinfo
Query language and locale informationsetlocale
Set locale informationRegular expression functions and encoding and transcoding functions are not considered.
In quite a few cases, Multibyte String offers an UTF-8 variant:
mb_convert_case
Perform case folding on a stringmb_parse_str
Parse GET/POST/COOKIE data and set global variablemb_split
Split multibyte string using regular expressionmb_strcut
Get part of stringmb_strimwidth
Get truncated string with specified widthmb_stripos
Finds position of first occurrence of a string within another, case insensitivemb_stristr
Finds first occurrence of a string within another, case insensitivemb_strlen
Get string lengthmb_strpos
Find position of first occurrence of string in a stringmb_strrchr
Finds the last occurrence of a character in a string within anothermb_strrichr
Finds the last occurrence of a character in a string within another, case insensitivemb_strripos
Finds position of last occurrence of a string within another, case insensitivemb_strrpos
Find position of last occurrence of a string in a stringmb_strstr
Finds first occurrence of a string within anothermb_strtolower
Make a string lowercasemb_strtoupper
Make a string uppercasemb_strwidth
Return width of stringmb_substr_count
Count the number of substring occurrencesmb_substr
Get part of stringAnd iconv provides a bare minimum of string functions:
iconv_strlen
Returns the character count of stringiconv_strpos
Finds position of first occurrence of a needle within a haystackiconv_strrpos
Finds the last occurrence of a needle within a haystackiconv_substr
Cut out part of a stringLastly Intl has a lot of extra and powerful Unicode features (but no regular expressions) as part of i18n. Some features overlap with other string functions. With respect to string functions these are: