Search code examples
unicodefiltermatchingmultibytenon-latin

preg_match a keyword variable against a list of latin and non-latin chars keywords in a local UTF-8 encoded file


I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.

How do I go about matching both Latin and non-Latin keywords.

The badwords.txt file includes one word per line as in this example

bad

nasty

racist

سفالة

وساخة

جنس

Code used for matching:

$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);

foreach ($badwords as $key => $val) {
    if (!empty($val)) {
        $val = trim($val);
        $regexp = "/\b" . $val . "\b/i";
        if (preg_match($regexp, $query))
            $badFlag = 1;

        if ($badFlag == 1) {
           // Bad word detected die...
        }
    }
}

I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.


Solution

  • The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set

    $wstart = '(^|[^\p{L}])';
    $wend = '([^\p{L}]|$)';
    

    and modify the regexp as follows:

    $regexp = "/" . $wstart . $val . $wend . "/iu";