Search code examples
commentsgrammarlexjflex

JFlex match nested comments as one token


In Mathematica a comment starts with (* and ends with *) and comments can be nested. My current approach of scanning a comment with JFlex contains the following code

%xstate IN_COMMENT

"(*"  { yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}

<IN_COMMENT> {
  "(*"        {yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
  [^\*\)\(]*  {return MathematicaElementTypes.COMMENT;}
  "*)"        {yypopstate(); return MathematicaElementTypes.COMMENT;}
  [\*\)\(]    {return MathematicaElementTypes.COMMENT;}
  .           {return MathematicaElementTypes.BAD_CHARACTER;}
}

where the methods yypushstate and yypopstate are defined as

private final LinkedList<Integer> states = new LinkedList();

private void yypushstate(int state) {
    states.addFirst(yystate());
    yybegin(state);
}
private void yypopstate() {
    final int state = states.removeFirst();
    yybegin(state);
}

to give me the opportunity to track how many nested levels of comment I'm dealing with.

Unfortunately, this results in several COMMENT tokens for one comment, because I have to match nested comment starts and comment ends.

Question: Is it possible with JFlex to use its API with methods like yypushback or advance() etc. to return exactly one token over the whole comment range, even if comments are nested?


Solution

  • It seems the bounty was uncalled for as the solution is so simple that I just did not consider it. Let me explain. When scanning a simple nested comment

    (* (*..*) *)
    

    I have to track, how many opening comment tokens I see so that I finally, on the last real closing comment can return the whole comment as one token.

    What I did not realise was, that JFlex does not need to be told to advance to the next portion when it matches something. After careful review I saw that this is explained here but somewhat hidden in a section I didn't care for:

    Because we do not yet return a value to the parser, our scanner proceeds immediately.

    Therefore, a rule in flex file like this

    [^\(\*\)]+ { }
    

    reads all characters except those that could probably be a comment start/end and does nothing but it advances to the next token.

    This means that I can simply do the following. In the YYINITIAL state, I have a rule that matches a beginning comment but it does nothing else then switch the lexer to the IN_COMMENT state. In particular, it does not return any token:

    {CommentStart}      { yypushstate(IN_COMMENT);}
    

    Now, we are in the IN_COMMENT state and there, I do the same. I eat up all characters but never return a token. When I hit a new opening comment, I carefully push it onto a stack but do nothing. Only, when I hit the last closing comment, I know I'm leaving the IN_COMMENT state and this is the only point, where I, finally, return the token. Let's look at the rules:

    <IN_COMMENT> {
      {CommentStart}  { yypushstate(IN_COMMENT);}
      [^\(\*\)]+      { }
      {CommentEnd}    {  yypopstate();
                         if(yystate() != IN_COMMENT)
                           return MathematicaElementTypes.COMMENT_CONTENT;
                      }
        [\*\)\(]      { }
        .             { return MathematicaElementTypes.BAD_CHARACTER; }
    }
    

    That's it. Now, no matter how deep your comment is nested, you will always get one single token that contains the entire comment.

    Now, I'm embarrassed and I'm sorry for such a simple question.

    Final note

    If you are doing something like this, you have to remember that you only return a token from when you hit the correct closing "character". Therefore, you definitely should make a rule that catches the end of file. In IDEA that default behavior is to just return the comment token, so you need another line (useful or not, I want to end gracefully):

        <<EOF>>  { yyclearstack(); yybegin(YYINITIAL);
                   return MathematicaElementTypes.COMMENT;}