Search code examples
phparraysreplacepreg-replacelarge-data

How to work with a function preg_replace with a large number (1000000) of values in $patterns and $replacements arrays?


Hello dear programmers! I'm having a problem with the speed of the function preg_replace().

When I had little values (words) in the $patterns and $replacements arrays, the problem was not with the speed of searching and replacing in text from arrays, and when the number of the value in the arrays increased by 1.000.000 then the function preg_replace() was repeatedly slowed down. How can I search and replace in the text if I have more than 1,000,000 values (words) in the arrays? How to make a replacement as quickly as possible? Can the solution of the problem be buffered or cached? What advise, how correctly to me to act?

Here is an example of my array:

$patterns = 
array
(
0 => "/\bмувосокори\b/u",
1 => "/\bмунаггас\b/u",
2 => "/\bмангит\b/u",
3 => "/\bмангития\b/u",
4 => "/\bмунфачир\b/u",
5 => "/\bмунфачира\b/u",
6 => "/\bманфиатпарасти\b/u",
7 => "/\bманфиатчу\b/u",
8 => "/\bманфиатчуи\b/u",
9 => "/\bманфиатхох\b/u",
10 => "/\bманфи\b/u",
...........................
1000000 => "/\bмусби\b/u"
)

$replacements =
array
(  
0 => "мувосокорӣ",
1 => "мунағғас",
2 => "манғит",
3 => "манғития",
4 => "мунфаҷир",
5 => "мунфаҷира",
6 => "манфиатпарастӣ",
7 => "манфиатҷӯ",
8 => "манфиатҷӯӣ",
9 => "манфиатхоҳ",
10 => "манфӣ",
.....................
1000000 => "мусбӣ"
);

$text = "мувосокори мунаггас мангит мангития мунфачир манфиатпарасти...";
$result = preg_replace($patterns, $replacements, $text);

I use this javascript function in index.html file:

<script>
function response(str) {
    if (str.length == 0) { 
        document.getElementById("text").innerHTML = "";
        return;
    } else {
        var xmlhttp = new XMLHttpRequest();
        xmlhttp.onreadystatechange = function() {
            if (this.readyState == 4 && this.status == 200) {
                document.getElementById("text").innerHTML = this.responseText;
            }
        };
        xmlhttp.open("GET", "response.php?request=" + str, true);
        xmlhttp.send();
    }
}
</script>

PHP file response.php source:

<?php

$patterns = array();
$replacements = array();

$request = $_REQUEST["request"];

$response = "";

if ($request !== "") {

$start = microtime(true);

$response = preg_replace($patterns, $replacements, $request);

$stop = microtime(true);

$time_replace = $stop - $start;

}

echo $response === "" ? "" : $response."<br>Time: $time_replace";

?>

Solution

  • The time complexity of your algorithm is roughly O(nm) where n is the number of words in your replacement array, and m the number of words in the request.

    Since all the patterns seem to look for words (\b before and after), and don't use any other regular expression syntax (only literal characters) you will get better performance by splitting the request into words and looking them up in an associative array, without the need to use regular expressions at all.

    So define your pattern/replacement data as an associative array, like this:

    $dict = array(
        "мувосокори" => "мувосокорӣ",
        "мунаггас" => "мунағғас",
        "мангит" => "манғит",
        "мангития" => "манғития",
        "мунфачир" => "мунфаҷир",
        "мунфачира" => "мунфаҷира",
        "манфиатпарасти" => "манфиатпарастӣ",
        "манфиатчу" => "манфиатҷӯ",
        "манфиатчуи" => "манфиатҷӯӣ",
        "манфиатхох" => "манфиатхоҳ",
        "манфи" => "манфӣ",
        ...........................
        "мусби" => "мусбӣ"
    );
    

    Then use preg_replace_callback to find each word in the request and look it up in the above dictionary:

    $response = preg_replace_callback("/\pL+/u", function ($m) use ($dict) {
        return isset($dict[$m[0]]) ? $dict[$m[0]] : $m[0];
    }, $request);
    

    The time complexity will be linear to the number of words in the request.

    Handling upper/lower case

    If you need to match also any variation in the capitalisation of the words, then it would be too much to store any such variation in the dictionary. Instead, you would keep the dictionary in all lowercase letters, and then use the code below. When there is a match with the dictionary, it checks what the capitalisation is of the original word, and applies the same to the replacement word:

    $response = preg_replace_callback("/\pL+/u", function ($m) use ($dict) {
        $word = mb_strtolower($m[0]);
        if (isset($dict[$word])) {
            $repl = $dict[$word];
            // Check for some common ways of upper/lower case
            // 1. all lower case
            if ($word === $m[0]) return $repl;
            // 2. all upper case
            if (mb_strtoupper($word) === $m[0]) return mb_strtoupper($repl);
            // 3. Only first letters are upper case
            if (mb_convert_case($word,  MB_CASE_TITLE) === $m[0]) return mb_convert_case($repl,  MB_CASE_TITLE);
            // Otherwise: check each character whether it should be upper or lower case
            for ($i = 0, $len = mb_strlen($word); $i < $len; ++$i) {
                $mixed[] = mb_substr($word, $i, 1) === mb_substr($m[0], $i, 1) 
                    ? mb_substr($repl, $i, 1)
                    : mb_strtoupper(mb_substr($repl, $i, 1));
            }
            return implode("", $mixed);
        }
        return $m[0]; // Nothing changes
    }, $request);
    

    Converting your existing arrays

    You can use a little piece of code to transform your current $patterns and $replacements arrays to the new data structure, to avoid you have to do it "manually":

    foreach ($patterns as $i => $pattern) {
        $dict[explode("\b", $pattern)[1]] = $replacements[$i];
    }
    

    Of course, you should not include this conversion in your code, but just run it once to produce the new array structure and then put that array literal in your code.