java parsing comments parser-generator javacc

C comment removal with JavaCC

I know how to skip these comments using SKIP declarations, but all I need to do is to take a C source and output the same source without comments.

So I declared a token <GENERIC_TEXT: (~[])+ >, that gets copied to output, and comments aren't skipped. I suspect this token takes all the input for itself.

Can someone help me, please?

Thank you

Solution

Don't use (~[])+: it will gobble up all your input. That is probably why you didn't see tokens being skipped.

In your default lexer mode, change to a different state when you encounter "/*" (the beginning of a multi-line comment). And in this different stat, either match "*/" (and swicth back to the default lexer-state), or match any char ~[] (not (~[])+!).

A quick demo:

CommentStripParser.jj

PARSER_BEGIN(CommentStripParser)

public class CommentStripParser {
  public static void main(String[] args) throws Exception {
    java.io.FileInputStream file = new java.io.FileInputStream(new java.io.File(args[0]));
    CommentStripParser parser = new CommentStripParser(file);
    parser.parse();
  }
}

PARSER_END(CommentStripParser)

TOKEN :
{
  < OTHER : ~[] >
}

SKIP :
{
  < "//" (~["\r", "\n"])* >
| < "/*" > : ML_COMMENT_STATE
}

<ML_COMMENT_STATE> SKIP :
{
  < "*/" > : DEFAULT
| < ~[] >   
}

void parse() :
{
  Token t;
}
{
  ( t=<OTHER> {System.out.print(t.image);} )* <EOF>
}

Given the test file:

Test.java

/*
 * comments
 */
class Test {
  // more comments
  int foo() {
    return 42;
  }
}

Run the demo like this (assuming you have the files CommentStripParser.jj, Test.java and the JAR javacc.jar in the same directory):

java -cp javacc.jar javacc CommentStripParser.jj 
javac -cp . *.java
java -cp . CommentStripParser Test.java

the following would be printed to your console:

class Test {

  int foo() {
    return 42;
  }
}

(no comments anymore)

Note that you will still need to account for string literals that might look like this:

"the following: /*, is not the start of a comment"

and char literals:

'"' // not the start of a string literal!