Search code examples
phpregexunicodepreg-replace-callback

Need altering PHP preg_replace_callback with Unicode for Cyrillic


I have the following code on PHP:

$oldstring = 'oldword1, oldword2. Oldword1. Oldword2. OLDWORD1. OLDWORD2';
$results = array(array('old'=>'oldword1', 'new'=>'newword1'), array('old'=>'oldword2', 'new'=>'newword2'));
foreach ($results as $row) {
    $fndrep[$row['old']] = $row['new'];
}

$pattern = '~(?=([A-Z]?)([a-z]?))\b(?i)(?:'
         . implode('|', array_keys($fndrep))
         . ')\b~';

$newstring = preg_replace_callback($pattern, function ($m) use ($fndrep) {
    $lowm = $fndrep[strtolower($m[0])];
    if ($m[1])
        return ($m[2]) ? ucfirst($lowm) : strtoupper($lowm);
    else
        return $lowm;
}, $oldstring);

echo $newstring;

As you can see it replaced all the old words with new ones. At that results array must contain the words for replacing only in lowercase. It works perfectly for Latin characters if "oldword" is in lowercase (oldword1, oldword2) or with a capital letter (Oldword1, Oldword2) or in uppercase (OLDWORD1, OLDWORD2). But I need the same solution for Cyrillic.

If I change

$pattern = '~(?=([A-Z]?)([a-z]?))\b(?i)(?:'

to Unicode

$pattern = '~(?=([\x{0410}-\x{042F}]?)([\x{0430}-\x{044F}]?))\b(?i)(?:'

and

. ')\b~';

to

. ')\b~u';

it works for Cyrillic too but only if "oldword" is in lowercase (oldword1, oldword2) and doesn't work if the "oldword" is with a capital letter (Oldword1, Oldword2) or in uppercase (OLDWORD1, OLDWORD2)

Can anyone resolve the problem?


Solution

  • I've found the solution. It turns out for Cyrillic we need to use mb_strtolower/mb_strtoupper and some more code instead of ucfirst() function. And I'm surprised that no one noticed it

    ...
    $newstring = preg_replace_callback($pattern, function ($m) use ($fndrep) {
        mb_internal_encoding('UTF-8');
        $lowm = $fndrep[mb_strtolower($m[0])];
        if ($m[1])
            return ($m[2]) ? 
            mb_strtoupper(mb_substr($lowm, 0, 1)) . mb_substr(mb_convert_case($lowm, MB_CASE_LOWER), 1, mb_strlen($lowm))
            : mb_strtoupper($lowm);
        else
            return $lowm;
    }, $oldstring);
    ...