Search code examples
phpencodingutf-8iconv

How can I detect a malformed UTF-8 string in PHP?


The iconv function sometimes gives me an error:

Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]

Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?


Solution

  • First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

    You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return empty ¹ (with no additional information² ) if an invalid string is given:

    $validUTF8 = (bool) preg_match('//u', $string);
    

    Another possibility is mb_check_encoding [PHP Manual]:

    $validUTF8 = mb_check_encoding($string, 'UTF-8');
    

    Another function you can use is mb_detect_encoding [PHP Manual]:

    $validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
    

    It's important to set the strict parameter to true.

    Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

    echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
    echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
    

    You can use @ and check the length of the return string:

    strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));
    

    Check the examples on the iconv manual page as well.


    Remarks:

    ¹ preg_match() empty return value:

    • 0 until 5.3.3 (including)
    • false since 5.3.4.

    (before 4.3.5/until 4.3.4: the //u test is not useful as it returns 1 on subject string "\x80" which is not a complete binary sequence in UTF-8, only a continuation byte at best, ref)

    ² with no additional information:

    The original 0 return value itself does not host any additional information nor does preg_match() yield a diagnostic message.

    As earlier outlined in comment/s, some more information can be obtained, especially there was a PREG_*_ERROR in case of a match error (no-match).

    This works by calling preg_last_error()PHP >= 5.2 after preg_match() and testing the return integer value against PREG_BAD_UTF8_ERROR to identify that the subject string is not UTF-8.

    For the diagnostic message use preg_last_error_msg()PHP >= 8, it returns the string "Malformed UTF-8 characters, possibly incorrectly encoded" (without the quotes) given the last error is PREG_BAD_UTF8_ERROR. (same ref)