Search code examples
regexperlunicodestdoutstdin

Perl 1 liner: Can print unicode input but regex not working; fancy word boundaries


Suppose I have (in Bash):

txt="На берегу пустынных волн
Стоял он, дум великих полн,
И вдаль глядел."

If I pipe this to Perl, I can print no problem:

$ echo "$txt" | perl -lnE 'say "$_"'
На берегу пустынных волн
Стоял он, дум великих полн,
И вдаль глядел.

But I am having issues with various regex on this text. Suppose I add the new Fancy Word Boundaries:

$ echo "$txt" | perl -lnE 'while (/\b{wb}(.+?)\b{wb}/g) { print "\"$1\"" }'
"–"
"ù"
"–"
"∞"
" "
"–"
"±"
"–µ—"
"Ä"
...
# junk characters...

The word boundaries are not working and the input characters are altered.

(If I change the regex to /\b{wb}(.+)\b{wb}/g the output is the same as the first. The (.+) consumes the entire line.)

I can fix these issues with the addition of the -CASD command line switch and the fancy word boundaries work as designed:

$ echo "$txt" | perl -CSAD  -lnE 'while (/\b{wb}(.+?)\b{wb}/g) { print "\"$1\"" }'
"На"
" "
"берегу"
" "
"пустынных"
" "
"волн"
"Стоял"
" "
"он"
","
" "
"дум"
" "
"великих"
" "
"полн"
","
"И"
" "
"вдаль"
" "
"глядел"
"."

The question: The -CASD switches in perlrun seems to imply that the unicode features enabled are for stdin and stdout input streams. There is no mention of any internal differences that would change a regex. Since I can read and print unicode in the first case, why does adding -CASD change the regex?

$ perl -v
This is perl 5, version 28, subversion 0 (v5.28.0) built for darwin-thread-multi-2level

Solution

  • In the first case, you aren't reading and printing unicode, you're reading and printing UTF-8. For Perl, these strings consist of bytes (octets), not characters, so it can find word boundaries in the middle of a multibyte sequence. See perlunicode for details.