Search code examples
regexperlutf-8decimalncr

Perl: Convert (high) decimal NCR to UTF-8


I have this string (Decimal NCRs): 日本の鍼灸とは

It represents the Japanese text 日本の鍼灸とは.

But I need (UTF-8): %E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF

For the first character: 日%E6%97%A5

This site does it, but how do I get this in Perl? (If possible in a single regex like s/\&\#([0-9]+);/uc('%'.unpack("H2", pack("c", $1)))/eg;.)

http://www.endmemo.com/unicode/unicodeconverter.php

Also I need to convert it back again from UTF-8 to Decimal NCRs

I've been breaking my head over this one for half a day now, any help is greatly appreciated!


Solution

  • #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Test::More tests => 2;
    use Encode qw{ encode decode };
    
    my $in = '日本の鍼灸とは'; # 日本の鍼灸とは
    my $out = '%E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF';
    
    (my $utf = $in) =~ s/&#(.*?);/chr $1/ge;
    
    my $r = join q(), map { sprintf '%%%2X', ord } split //, encode('utf8', $utf);
    is($r, $out);
    
    (my $s = $r) =~ s/%(..)/chr hex $1/ge;
    $s = decode('utf8', $s);
    $s = join q(), map '&#' . ord . ';', split //, $s;
    is($s, $in);