Search code examples
phpencodingutf-8preg-replacepreg-match

PHP preg_ /u utf-8 switch - Not understanding what it does in practice


I am converting a php/mariadb web application from latin1 to utf-8. I have it working but I am not using the /u switch on any of my preg_ statements and it seems to be working fine. I have tried samples of russian, chinese traditional and simple, japanese, arabic, hindu. Part of the application is a wiki which uses preg statements extensively and it works fine also.

So what is the preg /u switch suppose to do? ...since it seems to work fine without it?

I have been looking up information on this for 2 weeks and I can't find anything that explains the /u switch in a way that differentiates its use from 'not' using it.

I have determined that I do have the utf-8 pcre features in the prce that my php is using. I'm using PHP v5.6.20, MariaDB 5.5.32. I've got my web pages, mysql driver and mariadb all using utf-8.


Solution

  • The u modifier is used by PCRE when deciding how to handle certain matching cases. For example, with the dot metacharacter, multiple bytes are permitted, assuming they form a valid UTF-8 sequence:

    preg_match('/^.$/', '老');  // 0
    preg_match('/^.$/u', '老'); // 1
    

    Another example, when considering what is covered by a character class:

    preg_match('/^[[:print:]]$/', '老'); // 0
    preg_match('/^[[:print:]]$/u', '老'); // 1
    

    When including UTF-8 (or indeed a string encoded in any other encoding) directly in the regex, the u modifier effectively makes no difference, as PCRE is ultimately going match byte-by-byte.