I need to create a compiler for a custom language called Decaf. I need a python file called decaf-lexer.py, which prints the list of tokens detected by your compiler for a given input text file. Here is my Grammar in antlr for the Lexer:
grammar Decaf;
//********* LEXER ******************
fragment ALPHA : [a-zA-Z];
fragment DIGIT : [0-9];
ID : ALPHA( ALPHA | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
COMMENTS: '//' ~('\r' | '\n' )* -> skip;
WS : (' ' | '\n')+ ->skip;
LROUND : '(';
RROUND : ')';
LCURLY : '{';
RCURLY : '}';
LSQUARE: '[' ;
RSQUARE : ']';
SEMI : ';';
CLASS: 'class';
BOOLEAN : 'boolean';
BREAK : 'break';
CALLOUT : 'callout';
CONTINUE : 'continue';
ELSE : 'else';
FALSE : 'false';
FOR : 'for';
IF : 'if';
INT : 'int';
RETURN : 'return';
TRUE : 'true';
VOID : 'void';
CHAR : ALPHA|DIGIT|' '| '#' | '$' | '&' | '.' | ':' | '?' | '@' | '\\' | '^' | '_' | '`'| '|' | '~' | '\t'| '\n' ;
COMMA: ',';
COMPARE: '==';
NEQUAL: '!=';
GREQUAL: '>=';
LSEQUAL: '<=';
LS: '<';
GR: '>';
AND: '&&';
OROR: '||';
EQUALS: '=';
PEQUAL: '+=';
MEQUAL: '-=';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIVIDE: '/';
MOD: '%';
QUOTE: '"';
SQUOTE: '\'';
EXPLANATION: '!';
Here is the python code
import antlr4 as ant
from DecafLexer import DecafLexer
filein = open('example_01.decaf', 'r')
lexer = DecafLexer(ant.InputStream(filein.read()))
token = lexer.nextToken()
while token.type != -1:
print(lexer.symbolicNames[token.type])
token = lexer.nextToken()
The example file only contains:
(x + y)
The outcome is
LCURLY
COMMENTS
TIMES
COMMENTS
RCURLY
when it should be this, where am i going wrong????
LROUND
ID
PLUS
ID
RROUND
The array symbolicNames
contains the names of the named lexer rules you defined in the order in which you defined them. However, it does not contain the lexer rules that were implicitly defined for literals you use in your parser rules. Since those will have type number that comes before those of the named rules, that means you can not use token.type
as an index into symbolicNames
if you use any implicit lexer rules in your grammar.
Instead you should be using ruleNames
, which does include the implicit tokens. So for any token with a proper name lexer.ruleNames[token.type]
will correctly return that name and for any tokens created from string literals it will return a string like T__0
.