Search code examples
javaparsingcommentsparser-generatorjavacc

C comment removal with JavaCC


I know how to skip these comments using SKIP declarations, but all I need to do is to take a C source and output the same source without comments.

So I declared a token <GENERIC_TEXT: (~[])+ >, that gets copied to output, and comments aren't skipped. I suspect this token takes all the input for itself.

Can someone help me, please?

Thank you


Solution

  • Don't use (~[])+: it will gobble up all your input. That is probably why you didn't see tokens being skipped.

    In your default lexer mode, change to a different state when you encounter "/*" (the beginning of a multi-line comment). And in this different stat, either match "*/" (and swicth back to the default lexer-state), or match any char ~[] (not (~[])+!).

    A quick demo:

    CommentStripParser.jj

    PARSER_BEGIN(CommentStripParser)
    
    public class CommentStripParser {
      public static void main(String[] args) throws Exception {
        java.io.FileInputStream file = new java.io.FileInputStream(new java.io.File(args[0]));
        CommentStripParser parser = new CommentStripParser(file);
        parser.parse();
      }
    }
    
    PARSER_END(CommentStripParser)
    
    TOKEN :
    {
      < OTHER : ~[] >
    }
    
    SKIP :
    {
      < "//" (~["\r", "\n"])* >
    | < "/*" > : ML_COMMENT_STATE
    }
    
    <ML_COMMENT_STATE> SKIP :
    {
      < "*/" > : DEFAULT
    | < ~[] >   
    }
    
    void parse() :
    {
      Token t;
    }
    {
      ( t=<OTHER> {System.out.print(t.image);} )* <EOF>
    }
    

    Given the test file:

    Test.java

    /*
     * comments
     */
    class Test {
      // more comments
      int foo() {
        return 42;
      }
    }
    

    Run the demo like this (assuming you have the files CommentStripParser.jj, Test.java and the JAR javacc.jar in the same directory):

    java -cp javacc.jar javacc CommentStripParser.jj 
    javac -cp . *.java
    java -cp . CommentStripParser Test.java

    the following would be printed to your console:

    class Test {
    
      int foo() {
        return 42;
      }
    }
    

    (no comments anymore)

    Note that you will still need to account for string literals that might look like this:

    "the following: /*, is not the start of a comment"
    

    and char literals:

    '"' // not the start of a string literal!