Search code examples
perlutf-8utf-16unicode-stringunicode-escapes

How to use Perl pack to convert UTF-16 surrogate pairs to UTF-8?


I have input strings which contain text in which some characters are in UTF-16 format and escaped with '\u'. I am trying to, in Perl, convert all the strings to UTF-8. For example, the string 'Alice & Bob & Carol' might be formatted in the input as:

'Alice \u0026 Bob \u0026 Carol'

To do my desired conversion, I was doing...:

$str =~ s/\\u([A-Fa-f0-9]{4})/pack("U", hex($1))/eg;

...which worked fine until I got to input strings that contained UTF-16 surrogate pairs like:

'Alice \ud83d\ude06 Bob'

How do I modify the above code that uses pack to work with UTF-16 surrogate pairs? I would really like a solution that just uses pack without having to use any additional libraries (JSON::XS, Encode, etc.).


Solution

  • pack/unpack have no knowledge of UTF-16 text, just UTF-8 (And UTF-EBCDIC). You have to decode the surrogate pairs manually since you don't want to use a module.

    #!/usr/bin/env perl                                                                                                                                                                                                                              
    use strict;
    use warnings;
    use open qw/:locale/;
    use feature qw/say/;
    
    my $str = 'Alice \ud83d\ude06 Bob \u0026 Carol';
    
    # Convert surrogate pairs encoded as two \uXXXX sequences
    # Only match valid surrogate pairs so adjacent non-pairs aren't counted as one
    $str =~ s/\\u((?i)D[89AB]\p{AHex}{2}) # High surrogate in range 0xD800–0xDBFF
              \\u((?i)D[CDEF]\p{AHex}{2}) #  Low surrogate in range 0xDC00–0xDFFF
             /chr( ((hex($1) - 0xD800) * 0x400) + (hex($2) - 0xDC00) + 0x10000 )/xge;
    # Convert single \uXXXX sequences
    $str =~ s/\\u(\p{AHex}{4})/chr hex $1/ge;
    
    say $str;
    

    outputs

    Alice 😆 Bob & Carol