Search code examples
utf-8character-encodingperlperl-io

Non-determinism in encoding when using open() with scalar and I/O layers in Perl


For several hours now I am fighting a bug in my Perl program. I am not sure if I do something wrong or the interpreter does, but the code is non-deterministic while it should be deterministic, IMO. Also it exhibits the same behavior on ancient Debian Lenny (Perl 5.10.0) and a server just upgraded to Debian Wheezy (Perl 5.14.2). It boiled down to this piece of Perl code:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";

It initializes Perl 5 interpreter in strict mode with warnings enabled, with character strings (as opposed to byte strings) and named standard streams encoded in UTF8 (internal notion of UTF-8, but pretty close; changing to full UTF-8 makes no difference). Then it opens a file handle to an “in-memory file” (scalar variable), prints a single two-byte UTF-8 character into it and examines the variable upon closure.

The scalar variable now always has UTF8 bit flipped off. However it sometimes contains a byte string (converted to character string via utf8::decode()) and sometimes a character string that just needs to flip on its UTF8 bit (Encode::_utf8_on()).

When I execute my code repeatedly (1000 times, via Bash), it prints Undecoded and Decoded with approximately the same frequencies. When I change the string I write into the “file”, e.g. add a newline at its end, Undecoded disappears. When utf8::decode succeeds and I try it for the same original string in a loop, it keeps succeeding in the same instance of interpreter; however, if it fails, it keeps failing.

What is the explanation for the observed behavior? How can I use file handle to a scalar variable together with character strings?

Bash playground:

for i in {1..1000}; do perl -we 'use strict; use utf8; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my $c = ""; open C, ">:utf8", \$c; print C "š"; close C; die "Does not happen\n" if utf8::is_utf8($c); print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";'; done | grep Undecoded | wc -l

For reference and to be absolutely sure, I also made a version with pedantic error handling – same results.

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8" or die "Cannot binmode STDOUT\n";
binmode STDERR, ":utf8" or die "Cannot binmode STDERR\n";
my $c = "";
open C, ">:utf8", \$c or die "Cannot open: $!\n";
print C "š" or die "Cannot print: $!\n";
close C or die "Cannot close: $!\n";
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";

Solution

  • Examining $c in details reveals it has nothing to do with the content of $c or its internals, and the result of decode accurately represents what it did or didn't do.

    $ for i in {1..2}; do
         perl -MDevel::Peek -we'
            use strict; use utf8;
            binmode STDOUT, ":utf8";
            binmode STDERR, ":utf8";
            my $c = "";
            open C, ">:utf8", \$c;
            print C "š";
            close C;
            die "Does not happen\n" if utf8::is_utf8($c);
            Dump($c);
            print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
            Dump($c)
         '
         echo
      done
    

    SV = PV(0x17c8470) at 0x17de990
      REFCNT = 1
      FLAGS = (PADMY,POK,pPOK)
      PV = 0x17d7a40 "\305\241"
      CUR = 2
      LEN = 16
    Decoded
    SV = PV(0x17c8470) at 0x17de990
      REFCNT = 1
      FLAGS = (PADMY,POK,pPOK,UTF8)
      PV = 0x17d7a40 "\305\241" [UTF8 "\x{161}"]
      CUR = 2
      LEN = 16
    

    SV = PV(0x2d0fee0) at 0x2d26400
      REFCNT = 1
      FLAGS = (PADMY,POK,pPOK)
      PV = 0x2d1f4b0 "\305\241"
      CUR = 2
      LEN = 16
    Undecoded
    SV = PV(0x2d0fee0) at 0x2d26400
      REFCNT = 1
      FLAGS = (PADMY,POK,pPOK)
      PV = 0x2d1f4b0 "\305\241"
      CUR = 2
      LEN = 16
    

    This was a bug in utf8::decode, but it was fixed in 5.16.3 or earlier, probably 5.16.0 since it was still present in 5.14.2.

    A suitable workaround it to use Encode's decode_utf8 instead.