Search code examples
phpstringstring-comparison

comparing string with accents in php


I'm having problems when comparing two strings which contains accents. This is my case:

The first string is: Master The second string is: Máster Diseño Producción

Then, I need to remove the word Máster from the second string, because it's contained in the first string.

I have created a function for clean each string:

function sanear_string($cadena)
{
    $cadena = trim($cadena);

    $cadena = str_replace(
        array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
        array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
        $cadena
    );

    $cadena = str_replace(
        array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
        array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
        $cadena
    );

    $cadena = str_replace(
        array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
        array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
        $cadena
    );

    $cadena = str_replace(
        array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
        array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
        $cadena
    );

    $cadena = str_replace(
        array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
        array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
        $cadena
    );

    $cadena = str_replace(
        array('ñ', 'Ñ', 'ç', 'Ç'),
        array('n', 'N', 'c', 'C',),
        $cadena
    );

    //Esta parte se encarga de eliminar cualquier caracter extraño
    $cadena = str_replace(
        array("\\", "¨", "º", "-", "~",
            "#", "@", "|", "!", "\"",
            "·", "$", "%", "&", "/",
            "(", ")", "?", "'", "¡",
            "¿", "[", "^", "`", "]",
            "+", "}", "{", "¨", "´",
            ">", "<", ";", ",", ":",
            ".", " "),
        '',
        $cadena
    );


    return $cadena;
}

And it helps me to the problem of accents. Now I can use strpos to compare both strings...if result is > 0 then I know that the word is contained... but I need some help more.... Thanks in advance,


Solution

  • As usual when dealing with charset problems, you need to be extra careful about the character counts between multibyte strings and plain ASCII strings.

    Your biggest problem here is that you remove some pre-defined characters from the cleaned string, rendering character count coherence between the sanitized string and the original, thus greatly hardening the removal.

    I'll use a modified version of your sanitizing function:

    function sanitize($cadena) {
        $cadena = str_replace(
            array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
            array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
            $cadena
        );
    
        $cadena = str_replace(
            array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
            array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
            $cadena
        );
    
        $cadena = str_replace(
            array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
            array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
            array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
            array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ñ', 'Ñ', 'ç', 'Ç'),
            array('n', 'N', 'c', 'C',),
            $cadena
        );
    
    
        return strtolower($cadena);
    }
    

    The remove_word function follows:

    function remove_word($haystack , $needle) {
        // sanitize input strings
        $haystack_san = sanitize($haystack);
        $needle_san = sanitize($needle);
    
        // Check for character loss
        if (mb_strlen($haystack_san, 'UTF-8') != mb_strlen($haystack, 'UTF-8') || mb_strlen($needle_san, 'UTF-8') != mb_strlen($needle, 'UTF-8')) {
            // Here for debugging purposes. You may want to drop it in production.
            echo "Lost some chars on the way. Aborting.\n";
            echo "     haystack: $haystack (".mb_strlen($haystack, "UTF-8").")\n";
            echo " haystack_san: $haystack_san (".mb_strlen($haystack_san, "UTF-8").")\n";
            echo "       needle: $needle (".mb_strlen($needle, "UTF-8").")\n";
            echo "   needle_san: $needle_san (".mb_strlen($needle_san, "UTF-8").")\n";
            return;
        }
    
        // Check if $needle is found in $haystack
        if (($pos = strpos($haystack_san, $needle_san)) !== false) {
            // Get the string before the word
            $new = mb_substr($haystack, 0, $pos, 'UTF-8');
            // If applicable, get the string after
            if (mb_strlen($haystack, 'UTF-8') - $pos - mb_strlen($needle, 'UTF-8') > 0)
                $new .= mb_substr($haystack, $pos + mb_strlen($needle), NULL, 'UTF-8');
            // Return it
            return $new;
        }
    
        // If the word wasn't found, return $haystack as-is
        return $haystack;
    }
    
    echo remove_word("Hola, Máster Diseño Producción", "Master");
    // "Hola,  Diseño Producción"
    

    Note that:

    • This assumes your strings are UTF-8
    • The code relies on mb_* function to handle multi-byte characters
    • This only replaces the first occurence of the word (you may call remove_word until the string no longer changes if you want to replace all occurences)