Search code examples
parsingantlrlanguage-designantlr4lexer

Recognize multiple line comments within a single line with ANTLR4


I want to parse PostScript code with ANTLR4. I finished with the grammar, but one particular language extension (which was introduced by someone else) makes trouble being reconized.

A short example:

1: % This is a line comment
2: % The next line just pushes the value 10 onto the stack
3: 10
4: 
5: %?description This is the special line-comment in question
6: /procedure {
7:   /var1 30 def %This just creates a variable
8:   /var2 10 def %?description A description associated with var2 %?default 20
9:   /var3 (a string value) def %?description I am even allowed to use % signs %?default (another value)
10: }

Recognizing line-comments, such as in line 1, 2 and 7 can be done with the Lexer-Rules

LINE_COMMENT: '%' .*? NEWLINE;
NEWLINE: '\r'? '\n';

which simply match everything after a % until the end of the line.

The problem I have is with those special line-comments, that start with something like %?description or %?default, because those should be recognized as well, but in contrast to LINE_COMMENT, one can put multiple of those in a single line (such as in lines 8 and 9). So line 8 contains two special comments %?description A description associated with var2 and %?default 20.

Think of it as something like this (although this won't work):

SPECIAL_COMMENT: '%?' .*? (SPECIAL_COMMENT|NEWLINE);

Now comes the really tricky part: You should be allowed to put arbitrary text after %?description including % while still being able to split the individual comments.

So in short, the issue can be reduced to splitting a line of the form

(%?<keyword> <content with % allowed in it>)+ NEWLINE

e.g.

%?description descr. with % in in %?default (my default value for 100%) %?rest more

into

1.) %?description descr. with % in in 
2.) %?default (my default value for 100%)
3.) %?rest more

Any ideas, how to formulate Lexer or Parser-rules to achieve this?


Solution

  • Given those rules, I think you'll have to use a predicate in the lexer to check the input stream for occurrences of %?. You'll also have to make sure a normal comment must start with a %, but not followed by a ? (or line break char).

    Given the grammar:

    grammar T;
    
    @lexer::members {
      boolean ahead(String text) {
        for (int i = 0; i < text.length(); i++) {
          if (text.charAt(i) != _input.LA(i + 1)) {
            return false;
          }
        }
        return true;
      }
    }
    
    parse
     : token* EOF
     ;
    
    token
     : t=SPECIAL_COMMENT {System.out.println("special : " + $t.getText());}
     | t=COMMENT         {System.out.println("normal  : " + $t.getText());}
     ;
    
    SPECIAL_COMMENT
     : '%?' ( {!ahead("%?")}? ~[\r\n] )*
     ;
    
    COMMENT
     : '%' ( ~[?\r\n] ~[\r\n]* )?
     ;
    
    SPACES
     : [ \t\r\n]+ -> skip
     ;
    

    which can be tested as follows:

    String source = "% normal comment\n" +
        "%?description I am even allowed to use % signs %?default (another value)\n" +
        "% another normal comment (without a line break!)";
    TLexer lexer = new TLexer(new ANTLRInputStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
    

    and will print the following:

    normal  : % normal comment
    special : %?description I am even allowed to use % signs 
    special : %?default (another value)
    normal  : % another normal comment (without a line break!)
    

    The part ( {!ahead("%?")}? ~[\r\n] )* can be read as follows: if there's no "%?" ahead, match any char other than \r and \n, and do this zero or more times.