I've recently been tasked with writing an ANTLR3 grammar for a fictional language. Everything else seems fine, but I've a couple of minor issues which I could do with some help with:
1) Comments are between '/*'
and '*/'
, and may not be nested. I know how to implement comments themselves ('/*' .* '*/'
), but how would I go about disallowing their nesting?
2) String literals are defined as any sequence of characters (except for double quotes and new lines) in between a pair of double quotes. They can only be used in an output statement. I attempted to define this thus:
output : OUTPUT (STRINGLIT | IDENT) ;
STRINGLIT : '"' ~('\r' | '\n' | '"')* '"' ;
For some reason, however, the parser accepts
OUTPUT "Hello,
World!"
and tokenises it as "Hello, \nWorld
. Where the exclamation mark or closing "
went I have no idea. Something to do with whitespace maybe?
WHITESPACE : ( '\t' | ' ' | '\n' | '\r' | '\f' )+ { $channel = HIDDEN; } ;
Any advice would be much appreciated - thanks for your time! :)
The form you wrote already disallows nested comments. The token will stop at the first instance of */
, even if multiple /*
sequences appeared in the comment. To allow nested comments you have to write a lexer rule to specifically treat the nesting.
The problem here is STRINGLIT
does not allow a string to be split across multiple lines. Without seeing the rest of your lexer rules, I cannot tell you how this will be tokenized, but it's clear from the STRINGLIT
rule you gave that the sample input is not a valid string.
NOTE: Your input given in the original question was not clear, so I reformatted it in an attempt to show the exact input you were using. Can you verify that my edit properly represents the input?