Search code examples
phputf-8character-encodingstrlennon-ascii-characters

Odd behavior from mb_strlen when calling it through two functions


I often have to strip accents from strings, so I wrote a function, called accent(), to manage this more effectively. It was working well, but I recently ran into some characters that didn't get parsed correctly. This turned out to be an encoding issue (what else?) so I totally rewrote my code... and now I'm running into a new issue.

When I use the function directly, it seems to be working fine. However, when the function is called from within another function, it seems to break the code.

The second function, makesortname(), handles the creation of sort names. It does a bunch of stuff, then runs the result through accent() to strip any accents.

As an example, I'll take the name "Ekrem Ergün". Running it through makesortname() is supposed to return "ErgünEkrem" which then should become "ErgunEkrem" after using accent().

My accent() function uses mb_strlen() then runs each character in the string against a table to check for accents. If I print out each character to test it out, I'm noticing that mb_strlen is only reporting 5 characters instead of 10 and that 'ünEkre' is being treated as ONE character (which explains why the accent is not being stripped, as it's checking for that string instead of just 'ü').

Apparently, the problem seems to be my use of 'utf8' within the mb_strlen function. Thing is, if I don't include it, the code doesn't always work, depending on the string. And in this specific case, removing it only fixes the string length, but the ü still doesn't get parsed (even if I remove the 'utf8' from the mb_substr as well).

Here's the code I'm using.

function accent($term)
    {
    $orstr = $term;
    $str2 = $orstr;
    $strlen = mb_strlen($orstr, utf8);
    for( $i = 0; $i < $strlen; $i++ )
        {
        $char = mb_substr($orstr, $i, 1, utf8);

        $chkacc = mysql_db_query("Definitions","SELECT NoAcc_col FROM tbl_Accents WHERE Letr_col = '$char' ");
            while($row = mysql_fetch_object($chkacc))
                $noacc = $row->NoAcc_col;
            mysql_free_result($chkacc);

        if($noacc != '')    $newchar = $noacc;
        else                $newchar = $char;

        $str2 = str_replace($char, $newchar, $str2);
        unset($noacc);
        }
    return $str2;
    }

For full disclosure, I'll also include the makesortname() function, though I doubt it has anything to do with the problem...

function makesortname($nameN)
    {
    $nameN = dashnames($nameN);
    $wordlist = explode(' ', $nameN, 2);
    $wordc = count($wordlist);

    if($wordc == 1)             $nameS = $wordlist[0];
    if($wordc == 2)             $nameS = $wordlist[1] . $wordlist[0];

    $nameS = str_replace(' ', '', $nameS);          $nameS = str_replace(',', '', $nameS);
    $nameS = str_replace(':', '', $nameS);          $nameS = str_replace(';', '', $nameS);
    $nameS = str_replace('.', '', $nameS);          $nameS = str_replace('-', '', $nameS);
    $nameS = str_replace("'", '', $nameS);          $nameS = str_replace('"', '', $nameS);
    $nameS = str_replace("(", '', $nameS);          $nameS = str_replace(")", '', $nameS);
    $nameS = str_replace("]", '', $nameS);          $nameS = str_replace("[", '', $nameS);
    $nameS = str_replace("/", '', $nameS);
    $nameS = str_replace("&", 'and', $nameS);
    $nameS = strtolower(accent($nameS));

    return $nameS;
    }

Solution

  • So I managed to fix my own problem!

    I wrote a new function to check the encoding of the string, which then allows me to use either strlen/substr() or mb_strlen/mb_substr() depending on the encoding.

    Additionally, there also was an encoding issue within my mysql table.

    Now that all this has been fixed, the function works as expected.

    Thanks for your help and contributions, everyone!