In Mathematica a comment starts with (*
and ends with *)
and comments can be nested. My current approach of scanning a comment with JFlex contains the following code
%xstate IN_COMMENT
"(*" { yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
<IN_COMMENT> {
"(*" {yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
[^\*\)\(]* {return MathematicaElementTypes.COMMENT;}
"*)" {yypopstate(); return MathematicaElementTypes.COMMENT;}
[\*\)\(] {return MathematicaElementTypes.COMMENT;}
. {return MathematicaElementTypes.BAD_CHARACTER;}
}
where the methods yypushstate
and yypopstate
are defined as
private final LinkedList<Integer> states = new LinkedList();
private void yypushstate(int state) {
states.addFirst(yystate());
yybegin(state);
}
private void yypopstate() {
final int state = states.removeFirst();
yybegin(state);
}
to give me the opportunity to track how many nested levels of comment I'm dealing with.
Unfortunately, this results in several COMMENT
tokens for one comment, because I have to match nested comment starts and comment ends.
Question: Is it possible with JFlex to use its API with methods like yypushback
or advance()
etc. to return exactly one token over the whole comment range, even if comments are nested?
It seems the bounty was uncalled for as the solution is so simple that I just did not consider it. Let me explain. When scanning a simple nested comment
(* (*..*) *)
I have to track, how many opening comment tokens I see so that I finally, on the last real closing comment can return the whole comment as one token.
What I did not realise was, that JFlex does not need to be told to advance to the next portion when it matches something. After careful review I saw that this is explained here but somewhat hidden in a section I didn't care for:
Because we do not yet return a value to the parser, our scanner proceeds immediately.
Therefore, a rule in flex
file like this
[^\(\*\)]+ { }
reads all characters except those that could probably be a comment start/end and does nothing but it advances to the next token.
This means that I can simply do the following. In the YYINITIAL
state, I have a rule that matches a beginning comment but it does nothing else then switch the lexer to the IN_COMMENT
state. In particular, it does not return any token:
{CommentStart} { yypushstate(IN_COMMENT);}
Now, we are in the IN_COMMENT
state and there, I do the same. I eat up all characters but never return a token. When I hit a new opening comment, I carefully push it onto a stack but do nothing. Only, when I hit the last closing comment, I know I'm leaving the IN_COMMENT
state and this is the only point, where I, finally, return the token. Let's look at the rules:
<IN_COMMENT> {
{CommentStart} { yypushstate(IN_COMMENT);}
[^\(\*\)]+ { }
{CommentEnd} { yypopstate();
if(yystate() != IN_COMMENT)
return MathematicaElementTypes.COMMENT_CONTENT;
}
[\*\)\(] { }
. { return MathematicaElementTypes.BAD_CHARACTER; }
}
That's it. Now, no matter how deep your comment is nested, you will always get one single token that contains the entire comment.
Now, I'm embarrassed and I'm sorry for such a simple question.
If you are doing something like this, you have to remember that you only return a token from when you hit the correct closing "character". Therefore, you definitely should make a rule that catches the end of file. In IDEA that default behavior is to just return the comment token, so you need another line (useful or not, I want to end gracefully):
<<EOF>> { yyclearstack(); yybegin(YYINITIAL);
return MathematicaElementTypes.COMMENT;}