Search code examples
phpregexsearchspecial-characters

php regex match possible accented characters


I found alot of questions about this, but none of those helped me with my especific problem. The situation: I want to search a string with something like "blablebli" and be able to find a match with all possible accented variations of that ("blablebli", "blábleblí", "blâblèbli", etc...) in an text.

I already made a workaround to to the opposite (find a word without possible accents that i wrote). But i can't figure it out a way to implement what i want.

Here is my working code. (the relevant part, this was part of a foreach so we are only seeing a single word search):

$word="something";
$word = preg_quote(trim($word)); //Just in case
$word2 = $this->removeAccents($word); // Removed all accents
if(!empty($word)) {
    $sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking with and without accents.
    if (preg_match($sentence, $content)){
        echo "found";
    }
}

And my removeAccents() function (i'm not sure if i covered all possible accents with that preg_replace(). So far it's working. I would appreciate if someone check if i'm missing anything):

function removeAccents($string)
{
    return preg_replace('/[\`\~\']/', '', iconv('UTF-8', 'ASCII//TRANSLIT', $string));
}

What i'm trying to avoid:

  • I know i could check my $word and replace all a for [aàáãâä] and same thing with other letters, but i dont know... it seens a litle overkill.
  • And sure i could use my own removeAccents() function in my if statement to check the $content without accents, something like:

    if (preg_match($sentence, $content) || preg_match($sentence, removeAccents($content)))
    

But my problem with that second situation is i want to hightlight the word found after the match. So i can't change my $content.

Is there any way to improve my preg_match() to include possible accented characters? Or should i use my first option above?


Solution

  • Thanks for the help everyone, but i will end it up using my first sugestion i made in my question. And thanks again @CasimiretHippolyte for your patience, and making me realize that isn't that overkill as i thought.

    Here is the final code I'm using (first the functions):

    function removeAccents($string)
    {
        return preg_replace('/[\x{0300}-\x{036f}]/u', '', Normalizer::normalize($string, Normalizer::FORM_KD));
    }
    
    function addAccents($string)
    {
        $array1 = array('a', 'c', 'e', 'i' , 'n', 'o', 'u', 'y');
        $array2 = array('[aàáâãäå]','[cçćĉċč]','[eèéêë]','[iìíîï]','[nñ]','[oòóôõö]','[uùúûü]','[yýÿ]');
    
        return str_replace($array1, $array2, strtolower($string));
    }
    

    And:

    $word="something";
    $word = preg_quote(trim($word)); //Just in case
    $word2 = $this->addAccents($this->removeAccents($word)); //check all possible accents
    if(!empty($word)) {
        $sentence = "/(".$word.")|(".$word2.")/ui"; // Now I'm checking my normal word and the possible variations of it.
        if (preg_match($sentence, $content)){
            echo "found";
        }
    }
    

    Btw, im covering all possible accents from my country (and some others). You should check if you need to improve the addAccents() function before use it.