I want to can compress a string using the chars ASCII codes. I want to compress them using number patterns. Because ASCII codes are numbers, I want to find sub-patterns in the list of ASCII char codes.
This will be the format for every pattern I found:
The number patterns are not concretely established. But let me give you some examples:
Now let's see some situations:
I added dots ('.') and dashes ('-') so you can see them easily.
Indeed, we don't see good results (compression). I want to use this algorithm for large strings. And adding more rules (number patterns) we increase changes for making shorter result than original. I know the existent compressing solutions. I want this solution because the result have only digits, and it helps me.
// recursive function
function run (string $data, array &$rules): string {
if (strlen($data) == 1) {
// str_pad for having always ASCII code with 3 digits
return (str_pad(ord($data), 3, '0', STR_PAD_LEFT) .'.'. '1' .'.'. '01');
}
$ord = ord($data); // first char
$strlen = strlen($data);
$nr = str_pad($ord, 3, '0', STR_PAD_LEFT); // str_pad for having always ASCII code with 3 digits
$result = '';
// compares every rule
foreach ($rules as $key => $rule) {
for ($i = 1; $i < $strlen; $i++) {
// check for how many times this rule matches
if (!$rule($ord, $data, $i)) {
// save the shortest result (so we can compress)
if (strlen($r = ($nr .'.'. $key .'.'. $i .' - '. run(substr($data, $i), $rules))) < strlen($result)
|| !$result) {
$result = $r;
}
continue 2; // we are going to next rule
}
}
// if comes here, it means entire $data follow this rule ($key)
if (strlen($r = (($nr .'.'. $key .'.'. $i))) < strlen($result)
|| !$result) {
$result = $r; // entire data follow this $rule
}
}
return $result; // it will return the shortest result it got
}
// ASCII compressor
function compress (string $data): string {
$rules = array( // ASCII rules
1 => function (int $ord, string $data, int $i): bool { // same char
return ($ord == ord($data[$i]));
},
2 => function (int $ord, string $data, int $i): bool { // linear growth
return (($ord+$i) == ord($data[$i]));
},
3 => function (int $ord, string $data, int $i): bool { // progressive growth
return ((ord($data[$i-1])+$i) == ord($data[$i]));
},
4 => function (int $ord, string $data, int $i): bool { // linear decrease
return (($ord-$i) == ord($data[$i]));
},
5 => function (int $ord, string $data, int $i): bool { // progressive decrease
return ((ord($data[$i-1])-$i) == ord($data[$i]));
}
);
// we use base64_encode because we want only ASCII chars
return run(base64_encode($data), $rules);
}
I added dots ('.') and dashes ('-') only for testing easily.
compress("ana ar") => "089.1.1 - 087.1.1 - 053.1.1 - 104.1.1 - 073.1.1 - 071.4.2 - 121.1.01"
Which is ok. And it runs fast. Without a problem.
compress("ana aros") => Fatal error: Maximum execution time of 15 seconds exceeded
If string is a bit longer, it gets toooo much. It works fast and normal for 1-7 chars. But when there are more chars in string, that happens.
The algorithm doesn't run perfect and doesn't return the perfect 6-digit pattern, indeed. Before getting there, I'm stucked with that.
How I can increase performance of this backtracking for running ok now and also with more rules?
I came up with a initial solution to your problem. Please note:
$matches
and $rules
to a single array.adeflk
, the sequence def
is a sequence of 3, but because my solution is greedy, it will consider ad
as a sequence of 2, and ef
as another sequence of 2. That being said, you can still improve my code.<?php
function compress($string, $rules, $matches) {
if ($string === '') {
return getBestMatch($matches);
}
$currentCharacter = $string[0];
$matchFound = false;
foreach ($rules as $index => &$rule) {
if ($rule['active']) {
$soFarLength = strlen($matches[$index]);
if ($soFarLength === 0) {
$matchFound = true;
$matches[$index] = $currentCharacter;
} elseif ($rule['callback']($currentCharacter, $matches[$index])) {
$matches[$index] .= $currentCharacter;
$matchFound = true;
} else {
$rule['active'] = false;
}
}
}
if ($matchFound) {
return compress(substr($string, 1), $rules, $matches);
} else {
return getBestMatch($matches) . startNewSequence($string);
}
}
function getBestMatch($matches) {
$rule = -1;
$length = -1;
foreach ($matches as $index => $match) {
if (strlen($match) > $length) {
$length = strlen($match);
$rule = $index;
}
}
if ($length <= 0) {
return '';
}
return ord($matches[$rule][0]) . '.' . $rule . '.' . $length . "\n";
}
function startNewSequence($string) {
$rules = [
// rule number 1 - all characters are the same
1 => [
'active' => true,
'callback' => function ($a, $b) {
return $a === substr($b, -1);
}
],
// rule number 2 - ASCII code of current letter is one more than the last letter ("linear growth")
2 => [
'active' => true,
'callback' => function ($a, $b) {
return ord($a) === (1 + ord(substr($b, -1)));
}
],
// rule number 3 - ASCII code is a geometric progression. The ord() of each character increases with each step.
3 => [
'active' => true,
'callback' => function ($a, $b) {
if (strlen($b) == 1) {
return ord($a) > ord($b);
}
$lastCharOrd = ord(substr($b, -1));
$oneBeforeLastCharOrd = ord(substr($b, -2, 1));
$lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
$currentOrd = ord($a);
return ($currentOrd - $lastCharOrd) === ($lastDiff + 1);
}
],
// rule number 4 - ASCII code of current letter is one less than the last letter ("linear decrease")
4 => [
'active' => true,
'callback' => function ($a, $b) {
return ord($a) === (ord(substr($b, -1)) - 1);
}
],
// rule number 5 - ASCII code is a negative geometric progression. The ord() of each character decreases by one
// with each step.
5 => [
'active' => true,
'callback' => function ($a, $b) {
if (strlen($b) == 1) {
return ord($a) < ord($b);
}
$lastCharOrd = ord(substr($b, -1));
$oneBeforeLastCharOrd = ord(substr($b, -2, 1));
$lastDiff = $lastCharOrd - $oneBeforeLastCharOrd;
$currentOrd = ord($a);
return ($currentOrd - $lastCharOrd) === ($lastDiff - 1);
}
],
];
$matches = [
1 => '',
2 => '',
3 => '',
4 => '',
5 => '',
];
return compress($string, $rules, $matches);
}
echo startNewSequence('tsrqpozh');