Search code examples
phpencodingcharacter-encodingnormalize

Issues replacing special characters in PHP string


I'm trying to replace the special characters in a PHP string with normal characters (as in replace ó with o and á with a). I tried using the PHP Normalizer::normalize function as in the following code:

if (!Normalizer::isNormalized($word, Normalizer::FORM_C))
{
    echo "original: ".$word;
    $word = Normalizer::normalize($word, Normalizer::FORM_C);

    echo "\tnormalized: ".$word."<br />";
    exit; // see if it worked without having to go through every file
}

However, Normalizer::normalize returned null and the output from that code was:

original: adiós normalized:

Since this method didn't seem to be working, I went and found a function that was supposed to remove special characters. Here is the function:

function normalize ($string) {
    $table = array(
        'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
        'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
        'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
        'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
        'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
        'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
        'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
        'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r',
    );

    return strtr($string, $table);
}

This code had no noticeable effect, however, and returned the same string that was passed in.

I'm obtaining my strings from *.txt files in Windows 7. I've never been very good at encodings, and would appreciate any help on this issue.


Solution

  • I copied and pasted your code into my editor and something interesting happened. Instead of getting adios I was getting adjiós. Notice the j in the middle after the d. This was coming from the 'đ'=>'dj', in the first line of the table map. Apparently, my editor changed the đ to a regular d, and then it wouldn't convert the ó. I removed this key/value pair and suddenly it worked for me. Are you sure all of your keys are correct in your editor (Does you editor accept alternative character sets?) Here is my test file (with the đ removed:

    <html>
    <head>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
    </head>
    <body>
    <?php
    
    function normalize ($string) {
        $table = array(
            'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'Ž'=>'Z', 'ž'=>'z', 'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c',
            'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
            'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
            'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
            'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
            'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
            'ÿ'=>'y', 'R'=>'R', 'r'=>'r',
        );
    
        return strtr($string, $table);
    }
    
    $word = 'adiós';
    $length = strlen($word);
    
    echo 'original: '. $word;
    echo '<br />';
    echo 'normalized: '. normalize($word); 
    echo '<br />';
    echo 'loop: ';
    
    for($i = 0; $i < $length; $i++) {
        echo normalize($word[$i]);
    }
    
    ?>
    
    </body>
    </html>
    

    When I loop through each character with the 'd' => 'dj' in the array map then I correctly get adjios