Search code examples
perlutf-8pack

Perl: Packing a sequence of bytes into a string


I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf cannot deal with a wide-character string passed in for the placeholder %s.

In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)

The code below works when I use the character directly in the source.

But nothing that passes through pack works.

  • For the UTF-8 case, I need to set the UTF-8 flag on the string $ch , but how.
  • The UCS-2 case fails, and I suppose it's because there is no way for Perl UCS-2 from ISO-8859-1, so that test is probably bollocks, right?

The code:

#!/usr/bin/perl

use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"

# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

sub showme {
   my ($name,$ch) = @_;
   print "-------\n";
   print "This is test: $name\n";

   my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint

   {
      # https://perldoc.perl.org/bytes.html
      use bytes;
      my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
      my $txt  = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
      print $txt,"\n";
   }

   print $ch, "\n";
   print "Combine: $ch\n";
   print "Concat: " . $ch . "\n";
   print "Sprintf: " . sprintf("%s",$ch) . "\n";
   print "-------\n";
}


showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8"  , pack("HH","D0","B4"));  # UTF-8 of д is D0B4
showme("Cyrillic UCS-2"  , pack("HH","04","34"));  # UCS-2 of д is 0434

Current output:

Looks good

-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes

д
Combine: д
Concat: д
Sprintf: д
-------

That's a no. Where does the 176 come from??

-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no

а
Combine: а
Concat: а
Sprintf: а
-------

This is even worse.

-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no

0
Combine: 0
Concat: 0
Sprintf: 0
-------

Solution

  • Please see if following demonstration code of any help

    use strict;
    use warnings;
    use feature 'say';
    
    use utf8;     # https://perldoc.perl.org/utf8.html
    use Encode;   # https://perldoc.perl.org/Encode.html
    
    my $str;
    
    my $utf8   = 'Привет Москва';
    my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004';    # Little Endian
    my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430';    # Big Endian
    my $utf16  = '041f044004380432043504420020041c043e0441043a04320430';
    my $utf32  = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
    
    # https://perldoc.perl.org/functions/binmode.html
    
    binmode STDOUT, ':utf8'; 
    
    # https://perldoc.perl.org/feature.html#The-'say'-feature
    
    say 'UTF-8:   ' . $utf8;  
    
    # https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API
    
    $str = pack('H*',$ucs2be);
    say 'UCS-2BE: ' . decode('UCS-2BE',$str);  
    
    $str = pack('H*',$ucs2le);
    say 'UCS-2LE: ' . decode('UCS-2LE',$str);
    
    $str = pack('H*',$utf16);
    say 'UTF-16:  '. decode('UTF16',$str);
    
    $str = pack('H*',$utf32);
    say 'UTF-32:  ' . decode('UTF32',$str);
    

    Output

    UTF-8:   Привет Москва
    UCS-2BE: Привет Москва
    UCS-2LE: Привет Москва
    UTF-16:  Привет Москва
    UTF-32:  Привет Москва
    

    Supported Cyrillic encodings

    use strict;
    use warnings;
    use feature 'say';
    
    use Encode;
    use utf8;
    
    binmode STDOUT, ':utf8';
    
    my $utf8 = 'Привет Москва';
    my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;
    
    say '
    :: Supported Cyrillic encoding
    ---------------------------------------------
    UTF-8       ', $utf8;
    
    for (@encodings) {
        printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
    }
    

    Output

    :: Supported Cyrillic encoding
    ---------------------------------------------
    UTF-8       Привет Москва
    UCS-2       041f044004380432043504420020041c043e0441043a04320430
    UCS-2LE     1f044004380432043504420420001c043e0441043a0432043004
    UCS-2BE     041f044004380432043504420020041c043e0441043a04320430
    UTF-16      feff041f044004380432043504420020041c043e0441043a04320430
    UTF-32      0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
    ISO-8859-5  bfe0d8d2d5e220bcdee1dad2d0
    CP855       dde1b7eba8e520d3d6e3c6eba0
    CP1251      cff0e8e2e5f220cceef1eae2e0
    KOI8-F      f0d2c9d7c5d420edcfd3cbd7c1
    KOI8-R      f0d2c9d7c5d420edcfd3cbd7c1
    KOI8-U      f0d2c9d7c5d420edcfd3cbd7c1
    

    Documentation Encode::Supported