I need to define a PCRE regexp for certain spam-ish words in Arabic/Persian alphabet to be used in drupal spam module. The problem is that the usual PCRE regexp is apparently unable to find patters in Arabic alphabets.
For example, while /bad word/ flags instances of 'bad word', but
/کلمه بد/i
Is unable to flag 'کلمه بد'.
I have no problem with that if I use the u
(Unicode) PCRE modifier:
$string = 'کلمه بد';
if (preg_match('~\p{Arabic}~u', $string) > 0)
{
var_dump('contains Arabic characters');
if (preg_match('~کلمه بد~ui', $string) > 0)
{
var_dump('contains spam-ish Arabic characters');
}
}
string(26) "contains Arabic characters"
string(35) "contains spam-ish Arabic characters"
It runs just fine on IDEOne.com too. Be sure to save your files (and convert input data) in (to) UTF-8.