Search code examples
perlutf-8buffering

How to read unbuffered UTF-8 in Perl


I'm trying to read UTF-8 input in Perl in an unbuffered way (i.e. as soon as data is available, it should be returned):

die if !binmode STDIN, ':unix:utf8';
my $i;
my $buf;
while ($i = read(STDIN, $buf, 8192)) {
  print "$i\n";
}

However, it doesn't work if the input contains an UTF-8 character split:

$ perl -e '$|=1;print"\xc3";sleep 1;print"\xa1";sleep 1;print"AB"' | perl t.pl

This should print 1 and then 2, but it prints 3, so the buffering is withholding the first character even after it became available.

Is there an easy solution for this in Perl? Or maybe in another scripting language for Unix?


Solution

  • First, you need to change from read to sysread. read reads until it has the requested number of chars, while sysread returns as soon as data are available.

    But returning data as soon is arrives means you might have an incomplete UTF-8 character at the end, so you'll have to decode only characters fully received and buffer the rest.

    sub decode_utf8_partial {
       my $s = decode('UTF-8', $_[0], Encode::FB_QUIET);
       return undef
          if !length($s) && $_[0] =~ /
             ^
             (?: [\x80-\xBF]
             |   [\xC0-\xDF].
             |   [\xE0-\xEF]..
             |   [\xF0-\xF7]...
             |   [\xF8-\xFF]
             )
          /xs;
    
        return $s;
    }
    
    binmode($fh);
    
    my $buf;
    while (1) {
       my $rv = sysread($fh, $buf, 64*1024, length($buf));
       die $! if !defined($rv);
       last if !$rv;
    
       while (1) {
          # Leaves undecoded part in $buf    
          my $s = decode_utf8_partial($buf);
          die "Bad UTF-8" if !defined($s);
          last if !length($s);
    
          ... do something with $s ...
       }
    }