Search code examples
regexantlrtokenantlr3antlrv3ide

ANTLR3 String Literals and Disallowing Nested Comments


I've recently been tasked with writing an ANTLR3 grammar for a fictional language. Everything else seems fine, but I've a couple of minor issues which I could do with some help with:

1) Comments are between '/*' and '*/', and may not be nested. I know how to implement comments themselves ('/*' .* '*/'), but how would I go about disallowing their nesting?

2) String literals are defined as any sequence of characters (except for double quotes and new lines) in between a pair of double quotes. They can only be used in an output statement. I attempted to define this thus:

output : OUTPUT (STRINGLIT | IDENT) ;
STRINGLIT : '"' ~('\r' | '\n' | '"')* '"' ;

For some reason, however, the parser accepts

OUTPUT "Hello,
World!"

and tokenises it as "Hello, \nWorld. Where the exclamation mark or closing " went I have no idea. Something to do with whitespace maybe?

WHITESPACE : ( '\t' | ' ' | '\n' | '\r' | '\f' )+ { $channel = HIDDEN; } ;

Any advice would be much appreciated - thanks for your time! :)


Solution

    1. The form you wrote already disallows nested comments. The token will stop at the first instance of */, even if multiple /* sequences appeared in the comment. To allow nested comments you have to write a lexer rule to specifically treat the nesting.

    2. The problem here is STRINGLIT does not allow a string to be split across multiple lines. Without seeing the rest of your lexer rules, I cannot tell you how this will be tokenized, but it's clear from the STRINGLIT rule you gave that the sample input is not a valid string.

    NOTE: Your input given in the original question was not clear, so I reformatted it in an attempt to show the exact input you were using. Can you verify that my edit properly represents the input?