Search code examples
perlarabic

Convert text to value hexadecimal


I'm trying to put the word (for sale) "عربي" in Arabic. But my terminal reverses itself from left to right. Knowing that Arabic is written from right to left. the word is equivalent to "llbye" but the terminal writes "eybll" (ﻊﻴﺒﻠﻟ).

use strict;
use warnings;
use utf8;

binmode( STDOUT, ':utf8' );

use Encode qw< encode decode >;

my $str = 'ﻟﻠﺒﻴﻊ';    # "for sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );

my $decoded = pack 'U0W*', map +ord, split //, $enc;

print "Original string : $str\n";     #  ل ل ب ي ع
print "Decoded string 1: $dec\n";      #  ل ل ب ي ع
print "Decoded string 2: $decoded\n"; #  ل ل ب ي ع
my $k = reverse($decoded);
print "Decode  reverse : $k\n";
print "0x$_" for unpack "H*", scalar reverse "$decoded\n";

On line 21, I'm trying to better visualize converting these characters to hexdump, but I receive:

Character in 'H' format wrapped in unpack at line 21.

Term[Perl]:# perl schreib.pl Original string : ﻟﻠﺒﻴﻊ
Decoded string 1: ﻟﻠﺒﻴﻊ
Decoded string 2: ﻟﻠﺒﻴﻊ
Decode reverse : ﻊﻴﺒﻠﻟ

Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21
enter link description here

As in the image, the first blank frame is what I copy and paste, and the terminal inverts without my permission. having to use reverse to print from right to left as in the second frame, as it should have been when pasted.
How do I transform these characters into hexadecimal?


Solution

  • unpack H* expects a string of bytes (characters with value 00..FF), but you have a string of Unicode Code Points (characters with value 000000..10FFFF).

    You can use

    sprintf "%vX", $str
    

    which is effectively the same as

    join ".", map sprintf( "%X", ord( $_ ) ), split //, $str
    

    and

    join ".", map sprintf( "%X", $_ ), unpack "W*", $str
    

    All three work for any string (bytes, UCP, whatever).

    For $str, $dec and $decoded, the above produces

    FEDF.FEE0.FE92.FEF4.FECA
    

    For $enc, the above produces

    EF.BB.9F.EF.BB.A0.EF.BA.92.EF.BB.B4.EF.BB.8A
    

    (You may get something different since our files might not be the same.)


    With Unicode Code Points, we can use charnames (and/or Unicode::UCD) for more info.

    use charnames qw( :full );
    use feature qw( say );
    
    for my $cp ( unpack "W*", $str ) {
       my $ch = chr( $ucp );
       if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) {   # Not sure if good enough.
          printf "‹%s› ", $ch;
       } else {
          print "--- ";
       }
    
       printf "U+%X ", $ucp;
    
       say charnames::viacode( $ucp );
    }
    

    For $str, $dec and $decoded, the above produces

    ‹ﻟ› U+FEDF ARABIC LETTER LAM INITIAL FORM
    ‹ﻠ› U+FEE0 ARABIC LETTER LAM MEDIAL FORM
    ‹ﺒ› U+FE92 ARABIC LETTER BEH MEDIAL FORM
    ‹ﻴ› U+FEF4 ARABIC LETTER YEH MEDIAL FORM
    ‹ﻊ› U+FECA ARABIC LETTER AIN FINAL FORM
    

    Data::Dumper with local $Data::Dumper::Useqq = 1; will produce ASCII output as well.