Suppose I have (in Bash):
txt="На берегу пустынных волн
Стоял он, дум великих полн,
И вдаль глядел."
If I pipe this to Perl, I can print no problem:
$ echo "$txt" | perl -lnE 'say "$_"'
На берегу пустынных волн
Стоял он, дум великих полн,
И вдаль глядел.
But I am having issues with various regex on this text. Suppose I add the new Fancy Word Boundaries:
$ echo "$txt" | perl -lnE 'while (/\b{wb}(.+?)\b{wb}/g) { print "\"$1\"" }'
"–"
"ù"
"–"
"∞"
" "
"–"
"±"
"–µ—"
"Ä"
...
# junk characters...
The word boundaries are not working and the input characters are altered.
(If I change the regex to /\b{wb}(.+)\b{wb}/g
the output is the same as the first. The (.+)
consumes the entire line.)
I can fix these issues with the addition of the -CASD command line switch and the fancy word boundaries work as designed:
$ echo "$txt" | perl -CSAD -lnE 'while (/\b{wb}(.+?)\b{wb}/g) { print "\"$1\"" }'
"На"
" "
"берегу"
" "
"пустынных"
" "
"волн"
"Стоял"
" "
"он"
","
" "
"дум"
" "
"великих"
" "
"полн"
","
"И"
" "
"вдаль"
" "
"глядел"
"."
The question: The -CASD
switches in perlrun seems to imply that the unicode features enabled are for stdin
and stdout
input streams. There is no mention of any internal differences that would change a regex. Since I can read and print unicode in the first case, why does adding -CASD
change the regex?
$ perl -v
This is perl 5, version 28, subversion 0 (v5.28.0) built for darwin-thread-multi-2level
In the first case, you aren't reading and printing unicode, you're reading and printing UTF-8. For Perl, these strings consist of bytes (octets), not characters, so it can find word boundaries in the middle of a multibyte sequence. See perlunicode for details.