Search code examples

Perl internal representation of unicode string

I am working on a perl + Mojolicious web application and my front-end send a POST query containing accents in a "a" parameter ("été") using charset utf-8 as I can spy in chrome network tab. But server side script decode that parameter using a charset that I didn't expect. I wrote the following script to reproduce that case.

use utf8; #script encoded in utf8 without bom
use Mojolicious::Lite; 
use Data::HexDump;
    require Mojolicious;
    say "perl $^V, Mojolicious: v", Mojolicious->VERSION, ", ", `chcp` ;

post '/' => sub{
        my $self = shift;
        my $params = $self->req->params->to_hash;
        app->log->debug("received data:\n", HexDump( $params->{a} ) );
        use Devel::Peek;
        Dump( $params->{a} );
        $self->render( text => "ok for '$params->{a}'" );

if(my $pid = fork()){
    use Mojo::UserAgent;
    my $t = Mojo::UserAgent->new;
    #simulate front-end query
    my $tx  = $t->post('' => 
                            { 'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8' }, 
                            form => {  a => 'été'} 
    my $res = $tx->res->body;
    say "result:\n", HexDump($res);
    use Devel::Peek;
    Dump( $res );
    kill 'SIGKILL', $pid;

app->start(qw(daemon --listen http://*:3042 ));

The ouput of this script was:

perl v5.20.1, Mojolicious: v6.05, Page de codes active : 850

[Tue May 26 12:31:15 2015] [info] Listening at "http://*:3042"
Server available at
[Tue May 26 12:31:16 2015] [debug] Your secret passphrase needs to be changed
[Tue May 26 12:31:16 2015] [debug] POST "/"
[Tue May 26 12:31:16 2015] [debug] Routing to a callback
[Tue May 26 12:31:16 2015] [debug] received data:

          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  E9 74 E9                                           .t.

SV = PVMG(0x5a7a198) at 0x4dce730
  REFCNT = 1
  IV = 0
  NV = 0
  PV = 0x5b62c48 "\303\251t\303\251"\0 [UTF8 "\x{e9}t\x{e9}"]
  CUR = 5
  LEN = 10
[Tue May 26 12:31:16 2015] [debug] 200 OK (0.005052s, 197.941/s)
          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  6F 6B 20 66 6F 72 20 27 - C3 A9 74 C3 A9 27        ok for '..t..'

SV = PV(0x41a73e8) at 0x4927070
  REFCNT = 1
  PV = 0x5aa1328 "ok for '\303\251t\303\251'"\0
  CUR = 14
  LEN = 16

So we can see that the server receive the "a" parameter in an string flagged utf8 that contain the buffer "\x{e9}t\x{e9}".

I was expecting "été" with the hexa "C3 A9 74 C3 A9".

What is wrong?


  • update: There is nothing wrong with your program, you are getting été just like you wanted, its simply Dumped as the perl unicode string "\xE9t\xE9", they're the same thing, perl unicode strings aren't stored in memory as utf8, they're decoded from utf into unicode codepoints/ordinals, utf8 is just a way to encode/represent unicode codepoints/ordinals é is the ordinal 233, check the wikipedia link below (also updated program)

    Um, été is only C3 A9 74 C3 A9 in utf8, in numbers/ordinals été is 233 116 233

    which as a perl unicode string is \xE9t\xE9, the number 233 is E9 in hex

    update: before I created the utf8 file 2 with an editor, here its created with perl. You can see its got the right bytes you expect, and dd the difference when you read it as utf or as raw

    $ perl -CS -e " print chr(233), chr(116), chr(233) " >2
    $ od -tx1 2
    0000000 c3 a9 74 c3 a9
    $ type 2
    $ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_raw ) "
    $ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_utf8 ) "
    $ perl -MData::Dump -MPath::Tiny -e " dd( map { [ $_, ord$_ ] } split //, path(2)->slurp_utf8 ) "
    (["\xE9", 233], ["t", 116], ["\xE9", 233])