Search code examples
phpencodingutf-8strlenmultibyte-functions

PHP and UTF-8 String functions WITHOUT MB-Functions?


I try to use UTF-8 with PHP, the Output seems okay (Display correct äöüß etc, when testing) on my Site, but there is a simply Problem... When I use echo strlen("Ä"); it shows me "2"... I read this Topic: strlen() and UTF-8 encoding In the answer I read this:

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.

I wonder, why my Data is not valid UTF-8? Because:

  • I saved all my files in "UTF-8 no BOM"
  • Used UTF-8 header on the first line
  • My browser says also "Encoding: UTF-8"

This is my code:

<?php
header("Content-Type: text/html; charset=utf-8");

$test = 'Ä';
echo strlen($test);
var_dump($test);

?>

My Question: Can I use normal PHP-Functions with UTF-8 or must I use the "mb"-Functions?

If it's possible to use the normal PHP-Functions, why show me strlen() 2 in my code, instead of 1?


Solution

  • strlen() will return the length of the string in bytes by default, not characters... you can change this by setting the mbstring.func_overload ini setting to tell PHP to return characters from a strlen() call instead.... but this is global, and affects a number of other functions as well, like strpos() and substr() (full list in the documentation link)

    This can have serious adverse effects elsewhere in your code, particularly if you're using 3rd party libraries that aren't aware of it, so it isn't recommended.

    It's better to use the mb_* functions if you know that you're working with UTF-8 strings... and (when it comes to it) setting the mbstring.func_overload is simply telling PHP to use mb_* functions as an alternative to the normal string functions "under the hood"