Search code examples
phpunicodeutf-8utf8-decode

Unicode unknown "�" character detection in PHP


Is there any way in PHP of detecting the following character ?

I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if is present in a string. How do I do so with strpos?

Simply pasting the character into my codebase does not seem to work.

if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)

Solution

  • Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.

    Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.

    Test case (make sure you save the file as UTF-8):

    <?php
    
    header("Content-type: text/html; charset=utf-8");
    
    $teststring = "Düsseldorf";
    
    // Deliberately create broken string
    // by encoding the original string as ISO-8859-1
    $teststring_broken = utf8_decode($teststring); 
    
    echo "Broken string: ".$teststring_broken ;
    
    echo "<br>";
    
    $teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
    
    echo $teststring_converted;
    
    echo "<br>";
    
    if (strlen($teststring_converted) != strlen($teststring_broken  ))
     echo "The string contained an invalid character";
    

    in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.