I'm trying to read UTF-8 input in Perl in an unbuffered way (i.e. as soon as data is available, it should be returned):
die if !binmode STDIN, ':unix:utf8';
my $i;
my $buf;
while ($i = read(STDIN, $buf, 8192)) {
print "$i\n";
}
However, it doesn't work if the input contains an UTF-8 character split:
$ perl -e '$|=1;print"\xc3";sleep 1;print"\xa1";sleep 1;print"AB"' | perl t.pl
This should print 1 and then 2, but it prints 3, so the buffering is withholding the first character even after it became available.
Is there an easy solution for this in Perl? Or maybe in another scripting language for Unix?
First, you need to change from read
to sysread
. read
reads until it has the requested number of chars, while sysread
returns as soon as data are available.
But returning data as soon is arrives means you might have an incomplete UTF-8 character at the end, so you'll have to decode only characters fully received and buffer the rest.
sub decode_utf8_partial {
my $s = decode('UTF-8', $_[0], Encode::FB_QUIET);
return undef
if !length($s) && $_[0] =~ /
^
(?: [\x80-\xBF]
| [\xC0-\xDF].
| [\xE0-\xEF]..
| [\xF0-\xF7]...
| [\xF8-\xFF]
)
/xs;
return $s;
}
binmode($fh);
my $buf;
while (1) {
my $rv = sysread($fh, $buf, 64*1024, length($buf));
die $! if !defined($rv);
last if !$rv;
while (1) {
# Leaves undecoded part in $buf
my $s = decode_utf8_partial($buf);
die "Bad UTF-8" if !defined($s);
last if !length($s);
... do something with $s ...
}
}