So we know that
// This doesn't affect anything
/*
This doesn't affect anything either
*/
/*
/* /* /*
This doesn't affect anything
*/
This does because comments aren't recursive
/* /*
This doesn't affect anything
*/ */
This throws an error because the second * / is unmatched since comments aren't recursive
I've heard that the reason they aren't recursive is because they would slow down the compiler, and I guess that makes sense. However nowadays when I'm parsing c++ code in a higher level language (say Python), I can simply use the regular expression
"\/[\/]+((?![\n])[\s\S])*\r*\n"
to match // single line comments
, and use
"\/\*((?!\*\/)[\s\S])*\*\/"
to match /* multiline comments */
, then loop through all single line comments, remove them, then loop through all multi-line comments and remove them. Or vice versa. But that's where I'm stuck. It seems that doing one or the other isn't sufficient, because:
// /*
An error is thrown because the /* is ignored
*/
/*
This doesn't affect things because of mysterious reasons
// */
and
/*
This throws an error because the second * / is unmatched
// */ */
What is the reason for this behavior? Is it also an artifact of the way the compilers parse things? To be clear I don't want to change the behavior of c++, I would just like to know the reasoning behind the second set of examples behaving they way they do.
Edit:
So yes, to be more explicit, my question is why the following three (seemingly reasonable) ways of explaining this behavior don't work:
Simply ignore all characters on a line after // regardless of whether they are /* or * /, even if you are in a multiline comment.
Allow a / * or */ followed by a // to still have effect.
Both of the above.
I understand why nested comments aren't allowed, because they would require a stack and arbitrarily high amounts of memory. But these three cases would not.
Edit again:
If anyone is interested, here is the following code to extract comments of a c/c++ file in python following the correct commenting rules discussed here:
import re
commentScanner = re.Scanner([
(r"\/[\/]+((?![\n])[\s\S])*\r*(\n{1})?", lambda scanner, token: ("//", token)),
(r"\/\*((?!\*\/)[\s\S])*\*\/", lambda scanner, token: ("/* ... */", token)),
(r"[\s\S]", lambda scanner, token: None)
])
commentScanner.scan("fds a45fsa//kjl fds4325lkjfa/*jfds/\nk\lj\/*4532jlfds5342a l/*a/*b/*c\n//fdsafa\n\r\n/*jfd//a*/fd// fs54fdsa3\r\r//\r/*\r\n2a\n\n\nois")
It's not inconsistent. The existing behaviour is both easy to specify and easy to implement, and your compiler is implementing it correctly. See [lex.comment] in the standard.
The characters
/*
start a comment, which terminates with the characters*/
. These comments do not nest. The characters//
start a comment, which terminates with the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters//
,/*
, and*/
have no special meaning within a//
comment and are treated just like other characters. Similarly, the comment characters//
and/*
have no special meaning within a/*
comment. — end note ]
As you can see, //
can be used to comment out both /*
and */
. It's just that comments don't nest, so if the //
is already inside a /*
, then the //
has no effect at all.