Search code examples
phpregexutf-8pcrecharacter-class

Non-ASCII characters in UTF-8 mode regular expression


Question

Despite the PHP manual stating:

"In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

Why do Persian digits match \d or [[:digit:]] in "UTF-8 mode"?

Elaboration

In an answerer's remark in a non-related question it is mentioned that in regular expressions, \d does not only match ASCII digits 0 thru 9 but also, for example, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷).

The above mentioned question is tagged but the behavior can be observed in PHP as well. With this in mind I wrote the following "test":

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);

The resulting array $capture contains a match on 5 only.

Using the u modifier to turn on "UTF-8 mode" and running this:

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);

results in $capture containing matches on both ۳ and 5.

Notes

  • this question refers to PHP 5.6.22 (newest to date)
  • both tests were executed while explicitly using the C locale.

Solution

  • Because the documentation is broken. And it's not the only place where it is so, unfortunately.

    PHP uses PCRE under the hood to implement its preg_* functions. PCRE's documentation is thus authoritative there. PHP's documentation is based on PCRE's, but it looks like you found yet another mistake.

    Here's what you can read in PCRE's docs (emphasis mine):

    By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

    [:alnum:]  becomes  \p{Xan}
    [:alpha:]  becomes  \p{L}
    [:blank:]  becomes  \h
    [:digit:]  becomes  \p{Nd}
    [:lower:]  becomes  \p{Ll}
    [:space:]  becomes  \p{Xps}
    [:upper:]  becomes  \p{Lu}
    [:word:]   becomes  \p{Xwd}
    

    If you dig further in PHP's docs, you'll find the following:

    u (PCRE_UTF8)

    This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

    This is, unfortunately, a lie. The u modifier in PHP means PCRE_UTF8 | PCRE_UCP (UCP stands for Unicode Character Properties). The PCRE_UCP flag is the one that changes the meaning of \d, \w and the like, as you can see from the docs above. Your tests confirm that.


    As a side note, don't infer properties of one regex flavor from another. It doesn't always work (heh, even this chart forgot about the PCRE_UCP option).