Search code examples
phprecursioncompressionasciibacktracking

Compressing string using ASCII char codes using backtracking in php


I want to can compress a string using the chars ASCII codes. I want to compress them using number patterns. Because ASCII codes are numbers, I want to find sub-patterns in the list of ASCII char codes.

Theory

This will be the format for every pattern I found:

  • [nnn][n][nn], where:
    • [nnn] is the ASCII code for first char, from group numbers with same pattern.
    • [n] is a custom number for a certain pattern/rule (I will explain more below).
    • [nn] shows how many times this rule happens.

The number patterns are not concretely established. But let me give you some examples:

  1. same char
  2. linear growth (every number/ascii is greater, with one, than previous)
  3. linear decrease (every number/ascii is smaller, with one, than previous)

Now let's see some situations:

  • "adeflk" becomes "097.1.01-100.2.03-108.3.02"
    • same char ones, linear growth three times, linear decrease twice.
  • "rrrrrrrrrrr" becomes "114.1.11"
    • same char eleven times.
  • "tsrqpozh" becomes "116.3.06-122.1.01-104.1.01"
    • linear decrease six times, same char ones, same char ones.

I added dots ('.') and dashes ('-') so you can see them easily.

Indeed, we don't see good results (compression). I want to use this algorithm for large strings. And adding more rules (number patterns) we increase changes for making shorter result than original. I know the existent compressing solutions. I want this solution because the result have only digits, and it helps me.

What I've tried

// recursive function
function run (string $data, array &$rules): string {
    if (strlen($data) == 1) {
        // str_pad for having always ASCII code with 3 digits
        return (str_pad(ord($data), 3, '0', STR_PAD_LEFT) .'.'. '1' .'.'. '01');
    }

    $ord = ord($data); // first char
    $strlen = strlen($data);
    $nr = str_pad($ord, 3, '0', STR_PAD_LEFT); // str_pad for having always ASCII code with 3 digits
    $result = '';

    // compares every rule
    foreach ($rules as $key => $rule) {
        for ($i = 1; $i < $strlen; $i++) {
            // check for how many times this rule matches
            if (!$rule($ord, $data, $i)) {
                // save the shortest result (so we can compress)
                if (strlen($r = ($nr .'.'. $key .'.'. $i .' - '. run(substr($data, $i), $rules))) < strlen($result)
                || !$result) {
                    $result = $r;
                }
                continue 2; // we are going to next rule
            }
        }

        // if comes here, it means entire $data follow this rule ($key)
        if (strlen($r = (($nr .'.'. $key .'.'. $i))) < strlen($result)
        || !$result) {
            $result = $r; // entire data follow this $rule
        }
    }

    return $result; // it will return the shortest result it got
}

// ASCII compressor
function compress (string $data): string {
    $rules = array( // ASCII rules
        1 => function (int $ord, string $data, int $i): bool { // same char
            return ($ord == ord($data[$i]));
        },
        2 => function (int $ord, string $data, int $i): bool { // linear growth
            return (($ord+$i) == ord($data[$i]));
        },
        3 => function (int $ord, string $data, int $i): bool { // progressive growth
            return ((ord($data[$i-1])+$i) == ord($data[$i]));
        },
        4 => function (int $ord, string $data, int $i): bool { // linear decrease
            return (($ord-$i) == ord($data[$i]));
        },
        5 => function (int $ord, string $data, int $i): bool { // progressive decrease
            return ((ord($data[$i-1])-$i) == ord($data[$i]));
        }
    );

    // we use base64_encode because we want only ASCII chars
    return run(base64_encode($data), $rules);
}

I added dots ('.') and dashes ('-') only for testing easily.

Results

compress("ana ar") => "089.1.1 - 087.1.1 - 053.1.1 - 104.1.1 - 073.1.1 - 071.4.2 - 121.1.01"

Which is ok. And it runs fast. Without a problem.

compress("ana aros") => Fatal error: Maximum execution time of 15 seconds exceeded

If string is a bit longer, it gets toooo much. It works fast and normal for 1-7 chars. But when there are more chars in string, that happens.

The algorithm doesn't run perfect and doesn't return the perfect 6-digit pattern, indeed. Before getting there, I'm stucked with that.

Question

How I can increase performance of this backtracking for running ok now and also with more rules?


Solution

  • I came up with a initial solution to your problem. Please note:

    • You will never get a sequence of just one letter, because each 2 consecutive letters are a "linear growth" with a certain difference.
    • My solution is not very clean. You can, for example combine $matches and $rules to a single array.
    • My solution is naive and greedy. For example, in the example adeflk, the sequence def is a sequence of 3, but because my solution is greedy, it will consider ad as a sequence of 2, and ef as another sequence of 2. That being said, you can still improve my code.
    • The code is hard to test. You should probably make use of OOP and divide the code to many small methods that are easy to test separately.
    <?php
    
    function compress($string, $rules, $matches) {
        if ($string === '') {
            return getBestMatch($matches);
        }
        $currentCharacter = $string[0];
    
        $matchFound = false;
    
        foreach ($rules as $index => &$rule) {
            if ($rule['active']) {
                $soFarLength = strlen($matches[$index]);
                if ($soFarLength === 0) {
                    $matchFound = true;
                    $matches[$index] = $currentCharacter;
                } elseif ($rule['callback']($currentCharacter, $matches[$index])) {
                    $matches[$index] .= $currentCharacter;
                    $matchFound = true;
                } else {
                    $rule['active'] = false;
                }
            }
        }
    
        if ($matchFound) {
            return compress(substr($string, 1), $rules, $matches);
        } else {
            return getBestMatch($matches) . startNewSequence($string);
        }
    }
    
    function getBestMatch($matches) {
    
        $rule = -1;
        $length = -1;
        foreach ($matches as $index => $match) {
            if (strlen($match) > $length) {
                $length = strlen($match);
                $rule = $index;
            }
        }
        if ($length <= 0) {
            return '';
        }
        return ord($matches[$rule][0]) . '.' . $rule . '.' . $length . "\n";
    }
    
    function startNewSequence($string) {
        $rules = [
            // rule number 1 - all characters are the same
            1 => [
                'active' => true,
                'callback' => function ($a, $b) {
                    return $a === substr($b, -1);
                }
            ],
            // rule number 2 - ASCII code of current letter is one more than the last letter ("linear growth")
            2 => [
                'active' => true,
                'callback' => function ($a, $b) {
                    return ord($a) === (1 + ord(substr($b, -1)));
                }
            ],
            // rule number 3 - ASCII code is a geometric progression. The ord() of each character increases with each step.
            3 => [
                'active' => true,
                'callback' => function ($a, $b) {
                    if (strlen($b) == 1) {
                        return ord($a) > ord($b);
                    }
                    $lastCharOrd = ord(substr($b, -1));
                    $oneBeforeLastCharOrd = ord(substr($b, -2, 1));
                    $lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
                    $currentOrd = ord($a);
    
                    return ($currentOrd - $lastCharOrd) === ($lastDiff + 1);
                }
            ],
            // rule number 4 - ASCII code of current letter is one less than the last letter ("linear decrease")
            4 => [
                'active' => true,
                'callback' => function ($a, $b) {
                    return ord($a) === (ord(substr($b, -1)) - 1);
                }
            ],
            // rule number 5 - ASCII code is a negative geometric progression. The ord() of each character decreases by one
            // with each step.
            5 => [
                'active' => true,
                'callback' => function ($a, $b) {
                    if (strlen($b) == 1) {
                        return ord($a) < ord($b);
                    }
                    $lastCharOrd = ord(substr($b, -1));
                    $oneBeforeLastCharOrd = ord(substr($b, -2, 1));
                    $lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
                    $currentOrd = ord($a);
    
                    return ($currentOrd - $lastCharOrd) === ($lastDiff - 1);
                }
            ],
        ];
    
        $matches = [
            1 => '',
            2 => '',
            3 => '',
            4 => '',
            5 => '',
        ];
    
        return compress($string, $rules, $matches);
    }
    
    echo startNewSequence('tsrqpozh');