Search code examples
pythoncompiler-constructionantlrantlr4

Python Antlr why isnt my code coming up with the expected outcome?


I need to create a compiler for a custom language called Decaf. I need a python file called decaf-lexer.py, which prints the list of tokens detected by your compiler for a given input text file. Here is my Grammar in antlr for the Lexer:

grammar Decaf;

//********* LEXER ******************

fragment ALPHA : [a-zA-Z];
fragment DIGIT : [0-9];
ID : ALPHA( ALPHA | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
COMMENTS: '//' ~('\r' | '\n' )*  -> skip;
WS : (' ' | '\n')+  ->skip;

LROUND : '(';
RROUND : ')';
LCURLY : '{';
RCURLY : '}';
LSQUARE: '[' ;
RSQUARE : ']';
SEMI : ';';
CLASS: 'class';
BOOLEAN : 'boolean';
BREAK : 'break';
CALLOUT : 'callout';
CONTINUE : 'continue';
ELSE : 'else';
FALSE : 'false';
FOR : 'for';
IF : 'if';
INT : 'int';
RETURN : 'return';
TRUE : 'true';
VOID : 'void';
CHAR : ALPHA|DIGIT|' '| '#' | '$' | '&' | '.' | ':' | '?' | '@' | '\\' | '^' | '_' | '`'| '|' | '~' | '\t'| '\n' ;
COMMA: ',';
COMPARE: '==';
NEQUAL: '!=';
GREQUAL: '>=';
LSEQUAL: '<=';
LS: '<';
GR: '>';
AND: '&&';
OROR: '||';
EQUALS: '=';
PEQUAL: '+=';
MEQUAL: '-=';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIVIDE: '/';
MOD: '%';
QUOTE: '"';
SQUOTE: '\'';
EXPLANATION: '!';


Here is the python code

import antlr4 as ant
from DecafLexer import DecafLexer

filein = open('example_01.decaf', 'r')
lexer = DecafLexer(ant.InputStream(filein.read()))

token = lexer.nextToken()
while token.type != -1:
    print(lexer.symbolicNames[token.type])
    token = lexer.nextToken()

The example file only contains:

(x + y)

The outcome is

LCURLY
COMMENTS
TIMES
COMMENTS
RCURLY

when it should be this, where am i going wrong????

LROUND
ID
PLUS
ID
RROUND

Solution

  • The array symbolicNames contains the names of the named lexer rules you defined in the order in which you defined them. However, it does not contain the lexer rules that were implicitly defined for literals you use in your parser rules. Since those will have type number that comes before those of the named rules, that means you can not use token.type as an index into symbolicNames if you use any implicit lexer rules in your grammar.

    Instead you should be using ruleNames, which does include the implicit tokens. So for any token with a proper name lexer.ruleNames[token.type] will correctly return that name and for any tokens created from string literals it will return a string like T__0.