I'm having problems when comparing two strings which contains accents. This is my case:
The first string is: Master The second string is: Máster Diseño Producción
Then, I need to remove the word Máster from the second string, because it's contained in the first string.
I have created a function for clean each string:
function sanear_string($cadena)
{
$cadena = trim($cadena);
$cadena = str_replace(
array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
$cadena
);
$cadena = str_replace(
array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
$cadena
);
$cadena = str_replace(
array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
$cadena
);
$cadena = str_replace(
array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
$cadena
);
$cadena = str_replace(
array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
$cadena
);
$cadena = str_replace(
array('ñ', 'Ñ', 'ç', 'Ç'),
array('n', 'N', 'c', 'C',),
$cadena
);
//Esta parte se encarga de eliminar cualquier caracter extraño
$cadena = str_replace(
array("\\", "¨", "º", "-", "~",
"#", "@", "|", "!", "\"",
"·", "$", "%", "&", "/",
"(", ")", "?", "'", "¡",
"¿", "[", "^", "`", "]",
"+", "}", "{", "¨", "´",
">", "<", ";", ",", ":",
".", " "),
'',
$cadena
);
return $cadena;
}
And it helps me to the problem of accents. Now I can use strpos to compare both strings...if result is > 0 then I know that the word is contained... but I need some help more.... Thanks in advance,
As usual when dealing with charset problems, you need to be extra careful about the character counts between multibyte strings and plain ASCII strings.
Your biggest problem here is that you remove some pre-defined characters from the cleaned string, rendering character count coherence between the sanitized string and the original, thus greatly hardening the removal.
I'll use a modified version of your sanitizing function:
function sanitize($cadena) {
$cadena = str_replace(
array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
$cadena
);
$cadena = str_replace(
array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
$cadena
);
$cadena = str_replace(
array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
$cadena
);
$cadena = str_replace(
array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
$cadena
);
$cadena = str_replace(
array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
$cadena
);
$cadena = str_replace(
array('ñ', 'Ñ', 'ç', 'Ç'),
array('n', 'N', 'c', 'C',),
$cadena
);
return strtolower($cadena);
}
The remove_word
function follows:
function remove_word($haystack , $needle) {
// sanitize input strings
$haystack_san = sanitize($haystack);
$needle_san = sanitize($needle);
// Check for character loss
if (mb_strlen($haystack_san, 'UTF-8') != mb_strlen($haystack, 'UTF-8') || mb_strlen($needle_san, 'UTF-8') != mb_strlen($needle, 'UTF-8')) {
// Here for debugging purposes. You may want to drop it in production.
echo "Lost some chars on the way. Aborting.\n";
echo " haystack: $haystack (".mb_strlen($haystack, "UTF-8").")\n";
echo " haystack_san: $haystack_san (".mb_strlen($haystack_san, "UTF-8").")\n";
echo " needle: $needle (".mb_strlen($needle, "UTF-8").")\n";
echo " needle_san: $needle_san (".mb_strlen($needle_san, "UTF-8").")\n";
return;
}
// Check if $needle is found in $haystack
if (($pos = strpos($haystack_san, $needle_san)) !== false) {
// Get the string before the word
$new = mb_substr($haystack, 0, $pos, 'UTF-8');
// If applicable, get the string after
if (mb_strlen($haystack, 'UTF-8') - $pos - mb_strlen($needle, 'UTF-8') > 0)
$new .= mb_substr($haystack, $pos + mb_strlen($needle), NULL, 'UTF-8');
// Return it
return $new;
}
// If the word wasn't found, return $haystack as-is
return $haystack;
}
echo remove_word("Hola, Máster Diseño Producción", "Master");
// "Hola, Diseño Producción"
Note that:
mb_*
function to handle multi-byte charactersremove_word
until the string no longer changes if you want to replace all occurences)