I'm trying to run a simple test whereby I want to have differently formatted binary strings and print them out. In fact, I'm trying to investigate a problem whereby sprintf
cannot deal with a wide-character string passed in for the placeholder %s
.
In this case, the binary string shall just contain the Cyrillic "д" (because it's above ISO-8859-1)
The code below works when I use the character directly in the source.
But nothing that passes through pack
works.
$ch
, but how.The code:
#!/usr/bin/perl
use utf8; # Meaning "This lexical scope (i.e. file) contains utf8"
# https://perldoc.perl.org/open.html
use open qw(:std :encoding(UTF-8));
sub showme {
my ($name,$ch) = @_;
print "-------\n";
print "This is test: $name\n";
my $ord = ord($ch); # ordinal computed outside of "use bytes"; actually should yield the unicode codepoint
{
# https://perldoc.perl.org/bytes.html
use bytes;
my $mark = (utf8::is_utf8($ch) ? "yes" : "no");
my $txt = sprintf("Received string of length: %i byte, contents: %vd, ordinal x%04X, utf-8: %s\n", length($ch), $ch, $ord, $mark);
print $txt,"\n";
}
print $ch, "\n";
print "Combine: $ch\n";
print "Concat: " . $ch . "\n";
print "Sprintf: " . sprintf("%s",$ch) . "\n";
print "-------\n";
}
showme("Cryillic direct" , "д");
showme("Cyrillic UTF-8" , pack("HH","D0","B4")); # UTF-8 of д is D0B4
showme("Cyrillic UCS-2" , pack("HH","04","34")); # UCS-2 of д is 0434
Current output:
Looks good
-------
This is test: Cryillic direct
Received string of length: 2 byte, contents: 208.180, ordinal x0434, utf-8: yes
д
Combine: д
Concat: д
Sprintf: д
-------
That's a no. Where does the 176 come from??
-------
This is test: Cyrillic UTF-8
Received string of length: 2 byte, contents: 208.176, ordinal x00D0, utf-8: no
а
Combine: а
Concat: а
Sprintf: а
-------
This is even worse.
-------
This is test: Cyrillic UCS-2
Received string of length: 2 byte, contents: 0.48, ordinal x0000, utf-8: no
0
Combine: 0
Concat: 0
Sprintf: 0
-------
Please see if following demonstration code of any help
use strict;
use warnings;
use feature 'say';
use utf8; # https://perldoc.perl.org/utf8.html
use Encode; # https://perldoc.perl.org/Encode.html
my $str;
my $utf8 = 'Привет Москва';
my $ucs2le = '1f044004380432043504420420001c043e0441043a0432043004'; # Little Endian
my $ucs2be = '041f044004380432043504420020041c043e0441043a04320430'; # Big Endian
my $utf16 = '041f044004380432043504420020041c043e0441043a04320430';
my $utf32 = '0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430';
# https://perldoc.perl.org/functions/binmode.html
binmode STDOUT, ':utf8';
# https://perldoc.perl.org/feature.html#The-'say'-feature
say 'UTF-8: ' . $utf8;
# https://perldoc.perl.org/Encode.html#THE-PERL-ENCODING-API
$str = pack('H*',$ucs2be);
say 'UCS-2BE: ' . decode('UCS-2BE',$str);
$str = pack('H*',$ucs2le);
say 'UCS-2LE: ' . decode('UCS-2LE',$str);
$str = pack('H*',$utf16);
say 'UTF-16: '. decode('UTF16',$str);
$str = pack('H*',$utf32);
say 'UTF-32: ' . decode('UTF32',$str);
Output
UTF-8: Привет Москва
UCS-2BE: Привет Москва
UCS-2LE: Привет Москва
UTF-16: Привет Москва
UTF-32: Привет Москва
Supported Cyrillic encodings
use strict;
use warnings;
use feature 'say';
use Encode;
use utf8;
binmode STDOUT, ':utf8';
my $utf8 = 'Привет Москва';
my @encodings = qw/UCS-2 UCS-2LE UCS-2BE UTF-16 UTF-32 ISO-8859-5 CP855 CP1251 KOI8-F KOI8-R KOI8-U/;
say '
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 ', $utf8;
for (@encodings) {
printf "%-11s %s\n", $_, unpack('H*', encode($_,$utf8));
}
Output
:: Supported Cyrillic encoding
---------------------------------------------
UTF-8 Привет Москва
UCS-2 041f044004380432043504420020041c043e0441043a04320430
UCS-2LE 1f044004380432043504420420001c043e0441043a0432043004
UCS-2BE 041f044004380432043504420020041c043e0441043a04320430
UTF-16 feff041f044004380432043504420020041c043e0441043a04320430
UTF-32 0000feff0000041f0000044000000438000004320000043500000442000000200000041c0000043e000004410000043a0000043200000430
ISO-8859-5 bfe0d8d2d5e220bcdee1dad2d0
CP855 dde1b7eba8e520d3d6e3c6eba0
CP1251 cff0e8e2e5f220cceef1eae2e0
KOI8-F f0d2c9d7c5d420edcfd3cbd7c1
KOI8-R f0d2c9d7c5d420edcfd3cbd7c1
KOI8-U f0d2c9d7c5d420edcfd3cbd7c1
Documentation Encode::Supported