php recursion compression ascii backtracking

Compressing string using ASCII char codes using backtracking in php

I want to can compress a string using the chars ASCII codes. I want to compress them using number patterns. Because ASCII codes are numbers, I want to find sub-patterns in the list of ASCII char codes.

Theory

This will be the format for every pattern I found:

[nnn][n][nn], where:
- [nnn] is the ASCII code for first char, from group numbers with same pattern.
- [n] is a custom number for a certain pattern/rule (I will explain more below).
- [nn] shows how many times this rule happens.

The number patterns are not concretely established. But let me give you some examples:

same char
linear growth (every number/ascii is greater, with one, than previous)
linear decrease (every number/ascii is smaller, with one, than previous)

Now let's see some situations:

"adeflk" becomes "097.1.01-100.2.03-108.3.02"
- same char ones, linear growth three times, linear decrease twice.
"rrrrrrrrrrr" becomes "114.1.11"
- same char eleven times.
"tsrqpozh" becomes "116.3.06-122.1.01-104.1.01"
- linear decrease six times, same char ones, same char ones.

I added dots ('.') and dashes ('-') so you can see them easily.

Indeed, we don't see good results (compression). I want to use this algorithm for large strings. And adding more rules (number patterns) we increase changes for making shorter result than original. I know the existent compressing solutions. I want this solution because the result have only digits, and it helps me.

What I've tried

// recursive function
function run (string $data, array &$rules): string {
    if (strlen($data) == 1) {
        // str_pad for having always ASCII code with 3 digits
        return (str_pad(ord($data), 3, '0', STR_PAD_LEFT) .'.'. '1' .'.'. '01');
    }

    $ord = ord($data); // first char
    $strlen = strlen($data);
    $nr = str_pad($ord, 3, '0', STR_PAD_LEFT); // str_pad for having always ASCII code with 3 digits
    $result = '';

    // compares every rule
    foreach ($rules as $key => $rule) {
        for ($i = 1; $i < $strlen; $i++) {
            // check for how many times this rule matches
            if (!$rule($ord, $data, $i)) {
                // save the shortest result (so we can compress)
                if (strlen($r = ($nr .'.'. $key .'.'. $i .' - '. run(substr($data, $i), $rules))) < strlen($result)
                || !$result) {
                    $result = $r;
                }
                continue 2; // we are going to next rule
            }
        }

        // if comes here, it means entire $data follow this rule ($key)
        if (strlen($r = (($nr .'.'. $key .'.'. $i))) < strlen($result)
        || !$result) {
            $result = $r; // entire data follow this $rule
        }
    }

    return $result; // it will return the shortest result it got
}

// ASCII compressor
function compress (string $data): string {
    $rules = array( // ASCII rules
        1 => function (int $ord, string $data, int $i): bool { // same char
            return ($ord == ord($data[$i]));
        },
        2 => function (int $ord, string $data, int $i): bool { // linear growth
            return (($ord+$i) == ord($data[$i]));
        },
        3 => function (int $ord, string $data, int $i): bool { // progressive growth
            return ((ord($data[$i-1])+$i) == ord($data[$i]));
        },
        4 => function (int $ord, string $data, int $i): bool { // linear decrease
            return (($ord-$i) == ord($data[$i]));
        },
        5 => function (int $ord, string $data, int $i): bool { // progressive decrease
            return ((ord($data[$i-1])-$i) == ord($data[$i]));
        }
    );

    // we use base64_encode because we want only ASCII chars
    return run(base64_encode($data), $rules);
}

I added dots ('.') and dashes ('-') only for testing easily.

Results

compress("ana ar") => "089.1.1 - 087.1.1 - 053.1.1 - 104.1.1 - 073.1.1 - 071.4.2 - 121.1.01"

Which is ok. And it runs fast. Without a problem.

compress("ana aros") => Fatal error: Maximum execution time of 15 seconds exceeded

If string is a bit longer, it gets toooo much. It works fast and normal for 1-7 chars. But when there are more chars in string, that happens.

The algorithm doesn't run perfect and doesn't return the perfect 6-digit pattern, indeed. Before getting there, I'm stucked with that.

Question

How I can increase performance of this backtracking for running ok now and also with more rules?

Solution

I came up with a initial solution to your problem. Please note:

You will never get a sequence of just one letter, because each 2 consecutive letters are a "linear growth" with a certain difference.
My solution is not very clean. You can, for example combine $matches and $rules to a single array.
My solution is naive and greedy. For example, in the example adeflk, the sequence def is a sequence of 3, but because my solution is greedy, it will consider ad as a sequence of 2, and ef as another sequence of 2. That being said, you can still improve my code.
The code is hard to test. You should probably make use of OOP and divide the code to many small methods that are easy to test separately.

<?php

function compress($string, $rules, $matches) {
    if ($string === '') {
        return getBestMatch($matches);
    }
    $currentCharacter = $string[0];

    $matchFound = false;

    foreach ($rules as $index => &$rule) {
        if ($rule['active']) {
            $soFarLength = strlen($matches[$index]);
            if ($soFarLength === 0) {
                $matchFound = true;
                $matches[$index] = $currentCharacter;
            } elseif ($rule['callback']($currentCharacter, $matches[$index])) {
                $matches[$index] .= $currentCharacter;
                $matchFound = true;
            } else {
                $rule['active'] = false;
            }
        }
    }

    if ($matchFound) {
        return compress(substr($string, 1), $rules, $matches);
    } else {
        return getBestMatch($matches) . startNewSequence($string);
    }
}

function getBestMatch($matches) {

    $rule = -1;
    $length = -1;
    foreach ($matches as $index => $match) {
        if (strlen($match) > $length) {
            $length = strlen($match);
            $rule = $index;
        }
    }
    if ($length <= 0) {
        return '';
    }
    return ord($matches[$rule][0]) . '.' . $rule . '.' . $length . "\n";
}

function startNewSequence($string) {
    $rules = [
        // rule number 1 - all characters are the same
        1 => [
            'active' => true,
            'callback' => function ($a, $b) {
                return $a === substr($b, -1);
            }
        ],
        // rule number 2 - ASCII code of current letter is one more than the last letter ("linear growth")
        2 => [
            'active' => true,
            'callback' => function ($a, $b) {
                return ord($a) === (1 + ord(substr($b, -1)));
            }
        ],
        // rule number 3 - ASCII code is a geometric progression. The ord() of each character increases with each step.
        3 => [
            'active' => true,
            'callback' => function ($a, $b) {
                if (strlen($b) == 1) {
                    return ord($a) > ord($b);
                }
                $lastCharOrd = ord(substr($b, -1));
                $oneBeforeLastCharOrd = ord(substr($b, -2, 1));
                $lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
                $currentOrd = ord($a);

                return ($currentOrd - $lastCharOrd) === ($lastDiff + 1);
            }
        ],
        // rule number 4 - ASCII code of current letter is one less than the last letter ("linear decrease")
        4 => [
            'active' => true,
            'callback' => function ($a, $b) {
                return ord($a) === (ord(substr($b, -1)) - 1);
            }
        ],
        // rule number 5 - ASCII code is a negative geometric progression. The ord() of each character decreases by one
        // with each step.
        5 => [
            'active' => true,
            'callback' => function ($a, $b) {
                if (strlen($b) == 1) {
                    return ord($a) < ord($b);
                }
                $lastCharOrd = ord(substr($b, -1));
                $oneBeforeLastCharOrd = ord(substr($b, -2, 1));
                $lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
                $currentOrd = ord($a);

                return ($currentOrd - $lastCharOrd) === ($lastDiff - 1);
            }
        ],
    ];

    $matches = [
        1 => '',
        2 => '',
        3 => '',
        4 => '',
        5 => '',
    ];

    return compress($string, $rules, $matches);
}

echo startNewSequence('tsrqpozh');