I'm trying to create a compiler using ANTLR and Java. I have this problem where I have a rule and I can't get just a part of it to use. I have a command e.g. 0: HALT 0,0,0 and I want to ignore everything else after that.
e.g.0: HALT 0,0,0 blah blah blah, I want to ignore the blah blah blah
my rule is:
rule returns [String value]
:
INTEGER':' ro=rocommand i1=INTEGER',' i2=INTEGER ',' i3=INTEGER rest {$value = $ro.text+" "+$i1.text+","+$i2.text+","+$i3.text; }
| INTEGER':' rm=rmcommand j1=INTEGER ',' j2=INTEGER '('j3=INTEGER')' rest {$value = $rm.text+" "+$j1.text+","+$j2.text+"("+$j3.text+")"; }
;
and the code I have is:
CharStream charStream = new ANTLRStringStream(strLine);
simulatorLexer lexer = new simulatorLexer(charStream);
TokenStream tokenStream = new CommonTokenStream(lexer);
simulatorParser parser = new simulatorParser(tokenStream);
System.out.println(parser.rule());
What I get is:
0: rule:IN 0,0,0
1: rule:LDC 1,1,0
line 1:15 no viable alternative at character 'r'
line 1:18 no viable alternative at character '='
line 1:15 no viable alternative at character 'i'
for the text:
0: rule:IN 0,0,0
1: rule:LDC 1,1,0 r1=0
So it should parse the first line correctly and the 2nd until the 0. then it should ignore r1=0. It works correctly until now, but it shows a number of errors and I want to get rid of them. Please help me!
I'm posting the whole grammar so you can help me better. I just want to recognize the rule part.
program:
rule+
;
rocommand:
'HALT'|'IN'|'OUT'|'ADD'|'SUB'|'MUL'|'DIV'|'LDC'
;
rmcommand:
'LD'|'LDA'|'LDC'|'ST'|'JLT'|'JLE'|'JGE'|'JGT'|'JEQ'|'JNE'
;
rest:
~('\n'|'\r')* '\r'? ('\n'|EOF)
;
rule returns [String value]
:
INTEGER':' ro=rocommand i1=INTEGER',' i2=INTEGER ',' i3=INTEGER rest {$value = $ro.text+" "+$i1.text+","+$i2.text+","+$i3.text; }
| INTEGER':' rm=rmcommand j1=INTEGER ',' j2=INTEGER '('j3=INTEGER')' rest {$value = $rm.text+" "+$j1.text+","+$j2.text+"("+$j3.text+")"; }
;
WS : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;};
INTEGER : '0'..'9'+;
IGNORELINE : '*' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
There are a couple of things wrong with the rule:
rest:
~('\n'|'\r')* '\r'? ('\n'|EOF)
;
Inside parser rules, the ~
negates the entire set of tokens the lexer produces. So ~('\n'|'\r')
does not not match a single character other than '\n'
or '\r'
. It matches any token other than the tokens that matched \r
or \n
.
Also, since your lexer puts '\n'
and '\r'
on the hidden-channel, these token will not be available in your parser. This means that the '\n'
in the rest
rule can never be matched.
In short: you can't "tell" your parser what the end of a line is since these characters are discarded by your WS
rule. This means you have no way to properly write such a rest
parser rule.
For your input:
0: IN 0,0,0
1: LDC 1,1,0 r1=0
(note that I removed the 'rule:'
's)
the following tokens are produced by your lexer:
token[type=INTEGER text='0']
token[type=':' text=':']
token[type='IN' text='IN']
token[type=INTEGER text='0']
token[type=',' text=',']
token[type=INTEGER text='0']
token[type=',' text=',']
token[type=INTEGER text='0']
token[type=INTEGER text='1']
token[type=':' text=':']
token[type='LDC' text='LDC']
token[type=INTEGER text='1']
token[type=',' text=',']
token[type=INTEGER text='1']
token[type=',' text=',']
token[type=INTEGER text='0']
token[type=INTEGER text='1']
token[type=INTEGER text='0']
So these are the tokens available in your parser rules.
Note that the following two characters: '='
and 'r'
cannot be matched by the lexer as you can see by looking at the errors:
line 2:13 no viable alternative at character 'r'
line 2:15 no viable alternative at character '='
A possible solution would be to create a lexer rule that matches an integer and a colon:
START : INTEGER ':';
and let your rule
start with this token:
rule
: START ro=rocommand i1=INTEGER ',' i2=INTEGER ',' i3=INTEGER rest ...
| ...
;
That way, your rest
can match zero or more tokens other than that START
token:
rest
: ~START*
;
And to capture the '='
and 'r'
characters, create an ANY
rule and put this rule at the end of your lexer rules:
ANY : . ; // match any char
That way, the parser will create the following parse tree:
Another solution would be to create a LINE_BREAK
token:
LINE_BREAK : '\r'? '\n' | '\r';
(and remove \r
and \n
from WS
, of course!)
And do something like this:
rule
: INTEGER ':' ro=rocommand i1=INTEGER ',' i2=INTEGER ',' i3=INTEGER rest LINE_BREAK ...
| ...
;
rest
: ~LINE_BREAK*
;