Search code examples
phpregexdrupal-6utf-8arabic

How do I define a libpcre regexp for arabic characters?


I need to define a PCRE regexp for certain spam-ish words in Arabic/Persian alphabet to be used in drupal spam module. The problem is that the usual PCRE regexp is apparently unable to find patters in Arabic alphabets.

For example, while /bad word/ flags instances of 'bad word', but

/کلمه بد/i

Is unable to flag 'کلمه بد'.


Solution

  • I have no problem with that if I use the u (Unicode) PCRE modifier:

    $string = 'کلمه بد';
    
    if (preg_match('~\p{Arabic}~u', $string) > 0)
    {
        var_dump('contains Arabic characters');
    
        if (preg_match('~کلمه بد~ui', $string) > 0)
        {
            var_dump('contains spam-ish Arabic characters');
        }
    }
    
    string(26) "contains Arabic characters"
    string(35) "contains spam-ish Arabic characters"
    

    It runs just fine on IDEOne.com too. Be sure to save your files (and convert input data) in (to) UTF-8.