Search code examples
antlrantlr4context-free-grammar

Square brackets not recognized as tokens in ANTLR


I am currently creating a programming language for my semester project. We are using ANTLR as the choice of CC, and now we have run into a problem. When specifying the grammar for the declaration of arrays, ANTLR seems to not recognizing square brackets as tokens. For example, the following line of code:

string[] names = { "Bob", "Hans" }

will produce the error

extraneous input 'string[]' expecting {'end', 'num', 'bool', 'string', 'block', 'item', 'coords', 'break', 'for', 'while', 'until', 'switch', 'if', IDENTIFIER}

when the grammar for declarations are specified as the following

dcl
    : 'num' IDENTIFIER '=' (NUM | IDENTIFIER | accessing)
    | 'bool' IDENTIFIER '=' (BOOL | IDENTIFIER | accessing)
    | 'string' '[' ']' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
    | 'string' IDENTIFIER '=' (STR | IDENTIFIER | accessing)
    | 'block' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
    | 'item' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
    | 'coords' IDENTIFIER '=' (COORDS | IDENTIFIER | accessing)
    ;

However, it seems to work fine if I exchange the '[]' with '{}' or '()'. For example, the following line of code

string() names = { "Bob", "Hans" }

works fine with the following grammar

 | 'string' '(' ')' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)

Why does it work with other kinds of brackets and symbols, when it does not work with square brackets?

Edit

Here is the entire grammar

  grammar Minecraft;

/* LEXER RULES */
SINGLE_COMMENT      : '//' ~('\r' | '\n')* -> skip ;
MULTILINE_COMMENT   : '/*' .*? '*/' -> skip ;
WS                  : [ \t\n\r]+ -> skip ;
fragment LETTER     : ('a' .. 'z') | ('A' .. 'Z') ;
IDENTIFIER          : LETTER+ ;
fragment NUMBER     : ('0' .. '9') ;
BOOL                : 'true' | 'false' ;
NUM                 : NUMBER+ | NUMBER+ '.' NUMBER+ ;
STR                 : '"' (LETTER | NUMBER)* '"' | '\'' (LETTER | NUMBER)* '\'' ;
COORDS              : NUM ',' NUM ',' NUM ;
ITEM_ID             : NUMBER+ | NUMBER+ ':' NUMBER+ ;
MULDIVMODOP         : '*' | '/' | '%' ;
ADDSUBOP            : '+' | '-' ;
NEGOP               : '!' ;
EQOP                : '==' | '!=' | '<' | '<=' | '>' | '>=' ;
LOGOP               : '&&' | '||' ;

/* PROGRAM GRAMMAR */

prog                : 'begin' 'bot' body 'end' 'bot' ;
body                : glob_var* initiate main function* ;
initiate            : 'initiate' stmt* 'end' 'initiate' ;
main                : 'loop' stmt* 'end' 'loop' ;
type                : 'num' | 'bool' | 'string' | 'block' | 'item' | 'coords' ;

function
    : 'function' IDENTIFIER '(' args ')' stmt* 'end' 'function'
    | 'activity' IDENTIFIER '(' args ')' stmt* 'end' 'activity'
    ;

arg
    : (type | arr_names) IDENTIFIER
    | dcl
    ;

args                : arg ',' args | arg ;
i_args              : IDENTIFIER ',' i_args | /* epsilon */ ;

cond
    : '(' cond ')'
    | left=cond MULDIVMODOP right=cond
    | left=cond ADDSUBOP right=cond
    | NEGOP cond
    | left=cond EQOP right=cond
    | left=cond LOGOP right=cond
    | (NUM | STR | BOOL | ITEM_ID | COORDS | IDENTIFIER)
    ;

stnd_stmt
    : dcl
    | 'for' IDENTIFIER '=' NUM ('to' | 'downto') NUM 'do' stmt* 'end' 'for'
    | ('while' | 'until') cond 'repeat' stmt* 'end' 'repeat'
    | IDENTIFIER '(' i_args ')'
    | 'break'
    ;

stmt                : stnd_stmt | if_stmt ;
else_stmt           : stnd_stmt | ifelse_stmt ;

if_stmt
    : 'if' cond 'then' stmt* 'end' 'if'
    | 'if' cond 'then' stmt* 'else' else_stmt* 'end' 'if'
    ;

ifelse_stmt
    : 'if' cond 'then' else_stmt*
    | 'if' cond 'then' else_stmt* 'else' else_stmt*
    ;

glob_var            : 'global' dcl ;


str_arr_items       : (STR | IDENTIFIER) ',' str_arr_items | (STR | IDENTIFIER) ;

dcl
    : 'num' IDENTIFIER '=' (NUM | IDENTIFIER | accessing)
    | 'bool' IDENTIFIER '=' (BOOL | IDENTIFIER | accessing)
    | 'string' '[' ']' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
    | 'string' IDENTIFIER '=' (STR | IDENTIFIER | accessing)
    | 'block' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
    | 'item' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
    | 'coords' IDENTIFIER '=' (COORDS | IDENTIFIER | accessing)
    ;

arr_items        : 'num[]' | 'string[]' | 'block[]' | 'item[]' ;

accessing
    : IDENTIFIER '[' ('X' | 'Y' | 'Z') ']'
    | IDENTIFIER '[' NUM+ ']'
    ;

Solution

  • Seems like the line

    arr_items        : 'num[]' | 'string[]' | 'block[]' | 'item[]' ;
    

    created the tokens

    num[] string[] block[] and item[]

    which means, that when the parser came to parsing the symbol 'string[]', it would automatically convert it to the token 'string[]' and not the tokens 'string' '[' and ']'. When I deleted the line from the CFG, the parser would behave as expected. Thanks to Bart Kiers for pointing me towards this :)