Search code examples
javaparsinglexical-analysisjavacc

Parsing block comments with javacc


I'm trying to write some javacc grammar to parse a file that contains multi-line comments, for example, the following are all valid:

/**/
/* */
/* This is a comment */
/* This
   is
   a
   multiline
   comment
*/

I would like the parsing to fail if there is a /* not closed by a */, or a closing */ without an opening /*.

I'm not trying to skip the comments, I want the comments available as tokens.

So far I have tried this method, which works but will not fail on un-closed /*:

options {
  STATIC = false;
}

PARSER_BEGIN(BlockComments)

package com.company;

public class BlockComments {}

PARSER_END(BlockComments)

TOKEN : { < START_BLOCK_COMMENT : "/*" >  : WITHIN_BLOCK_COMMENT }
<WITHIN_BLOCK_COMMENT> TOKEN: { < BLOCK_COMMENT: (~["*", "/"] | "*" ~["/"])+ > }
<WITHIN_BLOCK_COMMENT> TOKEN: { < END_BLOCK_COMMENT: "*/" > : DEFAULT }

SKIP : {
  "\n"
}

The other option I have tried is this, which has the same problem and the slight difference that /* and */ are skipped instead being read as tokens:

options {
  STATIC = false;
}

PARSER_BEGIN(BlockComments)

package com.company;

public class BlockComments {}

PARSER_END(BlockComments)

SKIP : { "/*" : WITHIN_BLOCK_COMMENT }
<WITHIN_BLOCK_COMMENT> TOKEN: { <BLOCK_COMMENT: (~["*", "/"] | "*" ~["/"])+ > }
<WITHIN_BLOCK_COMMENT> SKIP : { "*/" : DEFAULT }

SKIP : {
  "\n"
}

I tried using MORE : { "/*" : WITHIN_BLOCK_COMMENT } in the second option which makes sure parsing fails for un-closed /*, but it makes all of the BLOCK_COMMENT tokens start with /* which I don't want.


Solution

  • I'm not sure what the rest of your file looks like, so I'll assume that a file is expected to be a sequence of comments preceded, followed, and separated by zero or more spaces and newlines.

    What I would do is this:

    TOKEN : { < BLOCK_COMMENT_START : "/*" >  : WITHIN_BLOCK_COMMENT }
    <WITHIN_BLOCK_COMMENT> TOKEN: { <CHAR_IN_COMMENT: ~[] > }
    <WITHIN_BLOCK_COMMENT> TOKEN: { < END_BLOCK_COMMENT: "*/" > : DEFAULT }
    
    SKIP : {
      "\n" | " " 
    }
    

    Now in the parser we have

    void start() : {String s ; } {
        (
            s = comment()  {System.out.println(s); }
        )*
    }
    
    String comment() :
    {   Token t ;
        StringBuffer b = new StringBuffer() ;
    }
    {  <START_BLOCK_COMMENT>
       (
             t=<CHAR_IN_COMMENT>  {b.append( t.image ) ; }
       )*
       <END_BLOCK_COMMENT>
       {return b.toString() ; }
    }
    

    Now you don't get a lexical error for a missing */, but you do get a parse exception.