I was somewhat surprised to observe that the following code
# comment
say 1;
# comment
say 2;
# comment say 3;
# comment say 4;
prints 1
, 2
, 3
, and 4
.
Here are the relevant characters after "# comment":
say "
".uninames.raku;
# OUTPUT: «("PARAGRAPH SEPARATOR", "LINE SEPARATOR", "<control-000B>", "<control-000C>").Seq»
Note that many/all of these characters are invisible in most fonts. At least with my editor, none cause the following text to be printed on a new line. And at least one (<control-000C>
, aka Form Feed
, sometimes printed as ^L
) is in fairly wide use in Vim/Emacs as a section separator.
This raises a few questions:
(I realize #4 is not technically a question, but I feel it needed to be said).
Raku's syntax is defined as a Raku grammar. The rule for parsing such a comment is:
token comment:sym<#> {
'#' {} \N*
}
That is, it eats everything after the #
that is not a newline character. As with all built-in character classes in Raku, \n
and its negation are Unicode-aware. The language design docs state:
\n matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.
Which is a reference to the Unicode standard for regular expressions.
I somewhat doubt there was ever a specific language design discussion along the lines of "let's enable all the kinds of newlines in Unicode, it'll be cool!" Rather, the decisions were that Raku should follow the Unicode regex technical report, and that Raku syntax would be defined in terms of a Raku grammar and thus make use of the Unicode-aware character classes. That a range of different newline characters are supported is a consequence of consistently following those principles.