Search code examples
phputf8mb4

Can php detect 4-byte encoded utf8 chars?


I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "𡃁","𨋢","𠵱","𥄫","𠽌","唧","𠱁". The table will popup error or skip the following texts.

How can I programmatically detect 4-byte encoded utf8 characters in PHP and replace them?


Solution

  • The following regular expression will replace 4-byte UTF-8 characters:

    function replace4byte($string, $replacement = '') {
        return preg_replace('%(?:
              \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
        )%xs', $replacement, $string);    
    }
    
    var_dump(replace4byte('d'), replace4byte('d𡃁d'));
    

    This doesn't rely on the /u modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback is neater.

    (Regex adapted from Ensuring valid utf-8 in PHP)