Search code examples
regexperlencodingcharacter-encodingdecoding

Perl: How to encode and decode characters in uppercase alpha letters only


im dealing with that quite some time now. I have characters from lets say the Latin alphabet and want them to be encoded in uppercase alpha strings only. Is there any module that could do this? Or any BaseX encoding that i can modify to just use uc alpha characters?

i currently have implemented parts of it using regex substitutions, but it only covers a few characters and is definetly not efficient :)

anyway if there is no way to deal with that via a module or function, is there any way to do this efficient via a regex?

i thought about a tr/[\+,\-,...]/[PLUS,MINUS,...]/cds;

but it seems like tr only substitutes char by char and not char by sequence of chars :(

any ideas?

achim


Solution

  • To answer the tr question:

    %subs = ( '+' => 'PLUS' );
    my $pat = join '|', map quotemeta, keys %subs;
    s/($pat)/$subs{$1}/g;
    

    Base 26 is possible to do, but it's a bit hard and inefficient to implement since 26 is not a power of 2. But it's definitely what you want. I'll see about coding it up.

    In the meantime, here's a base 16 solution:

    sub bytes_to_base16 {
       my $e = unpack('H*', $_);
       $e =~ tr/0123456789ABCDEFabcdef/ABCDEFGHIJKLMNOPKLMNOP/;
       return $e;
    }
    
    sub base16_to_bytes {
       my $e = $_[0];
       $e =~ tr/ABCDEFGHIJKLMNOP/0123456789ABCDEF/;
       return pack('H*', $_);
    }
    

    Let's see how efficient base 26 is compared to base 16:

    $ perl -MMath::BigInt -MMath::BigFloat -E'
       my $n = Math::BigInt->new(1);
       my $bs = 0;
       for (1..10) {
          $n <<= 8;
          ++$bs;
          my $bd16 = 2*$bs;
          my $bd26 = Math::BigFloat->new($n)->blog(26, 5)->bceil->numify;
          say sprintf "%2d bytes takes %2d base16 digits or %2d base26 digits.".
                      " base26 is %3.0f%% of the size of base16.",
             $bs, $bd16, $bd26, $bd26/$bd16*100;
          }
    '
     1 bytes takes  2 base16 digits or  2 base26 digits. base26 is 100% of the size of base16.
     2 bytes takes  4 base16 digits or  4 base26 digits. base26 is 100% of the size of base16.
     3 bytes takes  6 base16 digits or  6 base26 digits. base26 is 100% of the size of base16.
     4 bytes takes  8 base16 digits or  7 base26 digits. base26 is  88% of the size of base16.
     5 bytes takes 10 base16 digits or  9 base26 digits. base26 is  90% of the size of base16.
     6 bytes takes 12 base16 digits or 11 base26 digits. base26 is  92% of the size of base16.
     7 bytes takes 14 base16 digits or 12 base26 digits. base26 is  86% of the size of base16.
     8 bytes takes 16 base16 digits or 14 base26 digits. base26 is  88% of the size of base16.
     9 bytes takes 18 base16 digits or 16 base26 digits. base26 is  89% of the size of base16.
    10 bytes takes 20 base16 digits or 18 base26 digits. base26 is  90% of the size of base16.
    

    An efficient implementation would produce slightly less efficient output.

    $ perl -MMath::BigInt -MMath::BigFloat -E'
       my $bs = 0;
       for (1..10) {
          ++$bs;
          my $bd16 = 2*$bs;
          my $bd26 = int($bs/4)*7 + ($bs%4)*2;
          say sprintf "%2d bytes takes %2d base16 digits or %2d base26 digits.".
                      " base26 is %3.0f%% of the size of base16.",
             $bs, $bd16, $bd26, $bd26/$bd16*100;
          }
    '
     1 bytes takes  2 base16 digits or  2 base26 digits. base26 is 100% of the size of base16.
     2 bytes takes  4 base16 digits or  4 base26 digits. base26 is 100% of the size of base16.
     3 bytes takes  6 base16 digits or  6 base26 digits. base26 is 100% of the size of base16.
     4 bytes takes  8 base16 digits or  7 base26 digits. base26 is  88% of the size of base16.
     5 bytes takes 10 base16 digits or  9 base26 digits. base26 is  90% of the size of base16.
     6 bytes takes 12 base16 digits or 11 base26 digits. base26 is  92% of the size of base16.
     7 bytes takes 14 base16 digits or 13 base26 digits. base26 is  93% of the size of base16.
     8 bytes takes 16 base16 digits or 14 base26 digits. base26 is  88% of the size of base16.
     9 bytes takes 18 base16 digits or 16 base26 digits. base26 is  89% of the size of base16.
    10 bytes takes 20 base16 digits or 18 base26 digits. base26 is  90% of the size of base16.
    

    Note that the efficient implementation uses an extra digits for inputs that are 7 bytes long.

    So is it worth the effort of using base26 over base16? Probably not, unless each byte is really precious.


    And finally, here's a base 26 implementation.

    my @syms = ('A'..'Z');
    my %syms = map { $syms[$_] => $_ } 0..$#syms;
    
    sub bytes_to_base26 {
       my $e = '';
    
       my $full_blocks = int(length($_[0]) / 4);
       for (0..$full_blocks-1) {
          my $block = unpack('N', substr($_[0], $_*4, 4));
          $e .= join '', @syms[
             $block / 26**6 % 26,
             $block / 26**5 % 26,
             $block / 26**4 % 26,
             $block / 26**3 % 26,
             $block / 26**2 % 26,
             $block / 26**1 % 26,
             $block / 26**0 % 26,
          ];
       }
    
       my $extra = substr($_[0], $full_blocks*4);
       for my $block (unpack('C*', $extra)) {
          $e .= join '', @syms[
             $block / 26**1 % 26,
             $block / 26**0 % 26,
          ];
       }
    
       return $e;
    }
    
    sub base26_to_bytes {
       my $d = '';
    
       my $full_blocks = int(length($_[0]) / 7);
       for (0..$full_blocks-1) {
          my $block = 0;
          $block = $block*26 + $syms{$_} for unpack '(a)*', substr($_[0], $_*7, 7);
          $d .= pack('N', $block);
       }
    
       my $extra = substr($_[0], $full_blocks*7);
       my @extra = unpack('(a)*', $extra);
       while (@extra) {
          my $block = 0;
          $block = $block*26 + $syms{ shift(@extra) };
          $block = $block*26 + $syms{ shift(@extra) };
          $d .= pack('C', $block);
       }
    
       return $d;
    }