antlr4

mismatched input ''.'' in ANTLR4


I am new to ANTLR4 and my grammar contains a rule STRING: '\'' (~'\'' | '.' | '\\\'')* '\'' ;, but when I try to parse something containing '.' I obtain the error mismatched input ''.'' expecting {'(', ID, STRING}. Shouldn't the expected alternative STRING be able to parse my '.'?

This problem only happens if the character is a dot. I tried modifying my STRING rules or making a rule that only parse '.' but parsing a dot between quotes doesn't seems possible from my point of view.

EDIT: as requested here is my grammar

grammar Antlr4Grammar;

grammarFile: 'grammar' grammarName=ID ';' (grammarRules+=rules)* baseRules* EOF;

rules: name=ID ':' ((body=ruleBody) | (children+=ID)) ('|' children+=ID)* ';';


ruleBody: (bodies+=terminalRuleBody)+ 'EOF'?;

terminalRuleBody:
  body=terminalRuleBody (op=operator)
  | parentRuleBody
  | stringRuleBody
  | affectRuleBody 
  ;

starOperator: '*';
          
plusOperator: '+';

questionMarkOperator: '?';

operator: starOperator | plusOperator | questionMarkOperator;

parentRuleBody:
    '(' body=ruleBody ')';

stringRuleBody:
    body=STRING ;

affectRuleBody:
    name=ID op=affectOp value=rOperand;

rOperand:
    val=ID | val='INT' | val='FLOAT' | val='CHAR' | val='STRING' | val='ID' ;

affectOp: eqOp | plusEqOp;

eqOp: '=';

plusEqOp: '+=';

baseRules:
    intBaseRule | floatBaseRule | charBaseRule | stringBaseRule | idBaseRule | wsBaseRule;

intBaseRule: 'INT' ':' '\'-\'?[0-9]+' ';';

floatBaseRule: 'FLOAT' ':' '\'-\'?[0-9]+' '\'.\'' + '[0-9]*' ';' ;

charBaseRule: 'CHAR' ':' '\'\\\'\'' '(\'\\\\\'|.)' '\'\\\'\'' ';' ;

stringBaseRule: 'STRING' ':' '\'"\'' '(\'\\\\\'|.)*?' '\'"\'' ';';

idBaseRule: 'ID' ':' '[a-zA-Z_]' '[a-zA-Z_0-9]*' ';' ;

wsBaseRule: 'WS' ':' '[ \\t\\r\\n]' '->' 'skip' ';' ;

ID: [a-zA-Z_] [a-zA-Z_0-9]*;
STRING: '\'' (~'\'' | '.' | '\\\'')* '\'' ;
WS: [ \t\r\n]+ -> skip;

and here is the program I try to parse

grammar MiniJava;


program: mainClass=mainClass ( classDecl+=classDeclaration )*;

mainClass: 'class' name=ID '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' argName=ID ')' '{' body=statement '}' '}';

classDeclaration: 'class' name=ID ('extends' parentClass=ID)? '{' (varDecl+=varDeclaration)* (methodDecl+=methodDeclaration)* '}';

varDeclaration: varType=type varName=ID ';';

methodDeclaration: 'public' returnType=type methodName=ID '(' (argType+=type argName+=ID (',' argType+=type argName+=ID)* )? ')' '{' (varDecl+=varDeclaration)* (body+=Statement)* 'return' returnExpr=expression ';' '}';

type: intArrayType | booleanType | intType | idType;

intArrayType: 'int' '[' ']';

booleanType: 'boolean';

intType: 'int';

idType: typeName=ID;

statement: compoundStatement | ifStatement | whileStatement | printStatement | affectStatement | affectArrayStatement;

compoundStatement: '{' (body+=statement)* '}';

ifStatement: 'if' '(' cond=expression ')' thenPart=statement 'else' elsePart=statement;

whileStatement: 'while' '(' cond=expression ')' body=statement;

printStatement: 'System.out.println' '(' printExpr=expression ')' ';';

affectStatement: lValue=ID '=' rValue=expression ';';

affectArrayStatement: array=ID '[' index=expression ']' '=' value=expression ';' ;

expression: binOpExpr;

binOpExpr: loperand=expression op=operator roperand=postfixExpression | postfixExpression;

operator: andOperator | lessThanOperator | plusOperator | minusOperator | multOperator;

andOperator: '&&';

lessThanOperator: '<';

plusOperator: '+';

minusOperator: '-';

multOperator: '*';

postfixExpression: baseExpr=unaryExpression operator=postfixOperator ;

postfixOperator: arrayIndexOperation | lengthOperation | methodCallOperation;

arrayIndexOperation:  '[' index=expression ']';

lengthOperation: '.' 'length';

methodCallOperation: '.' methodName=ID '(' (args+=expression (',' args+=expression)*)? ')';

unaryExpression: constIntExpression | trueExpression | falseExpression | varExpression | thisExpression | newExpression | notExpression | parentExpression;

constIntExpression: value=INT;

trueExpression: 'true';

falseExpression: 'false';

varExpression: varName=ID;

thisExpression: 'this';

newExpression: newClassExpression | newArrayExpression;

newClassExpression: 'new' name=ID '(' ')';
newArrayExpression: 'new' 'int' '[' size=expression ']';


notExpression: '!' baseExpr=expression;

parentExpression: '(' baseExpr=expression ')';

INT: '-'?[0-9]+;
FLOAT: '-'?[0-9]+ '.' [0-9]*;
CHAR: '\'' ('\\'|.) '\'';
STRING: '"' ('\\'|.)*? '"';
ID: [a-zA-Z_] [a-zA-Z_0-9]*;
WS: [ \t\r\n] -> skip ;
                                  

I didn't give it at first because I was worried it was a little confusing as my grammar represents a subset of the antlr4 grammar.


Solution

  • You must be careful when mixing literal tokens in a parser rule (the 'literal' in my example) and lexer rules that might also match what you defined as literal tokens (the ID rule below):

    parse
     : 'literal' ID
     ;
    
    ID
     : [a-zA-Z]+
     ;
    

    If you try to let the input "literal literal" be parsed by the parse rule, it will not work, even though the input "literal" could be matched by the ID rule.

    This is because ANTLR will translate my example as follows:

    parse
     : T__0 ID
     ;
    
    T__0
     : 'literal'
     ;
    
    ID
     : [a-zA-Z]+
     ;
    

    And given that ANTLR will always create a single token for a particular input, it will always tokenise "literal" as a T__0 token; it will never become an ID token.

    This is what is happening in your grammar as well. In your floatBaseRule, you have defined the literal token '\'.\'' that your stringRuleBody is trying to match as a STRING (but cannot do so):

    floatBaseRule. : 'FLOAT' ':' '\'-\'?[0-9]+' '\'.\''+ '[0-9]*' ';' ;
    stringRuleBody : STRING ;
    

    You either need to change '\'.\'' into stringRuleBody:

    floatBaseRule. : 'FLOAT' ':' '\'-\'?[0-9]+' stringRuleBody+ '[0-9]*' ';' ;
    stringRuleBody : STRING ;
    

    or let stringRuleBody also match the literal '\'.\'':

    floatBaseRule. : 'FLOAT' ':' '\'-\'?[0-9]+' '\'.\''+ '[0-9]*' ';' ;
    stringRuleBody : STRING | '\'.\'';
    

    IMO, the best is to remove all these '...' literal tokens from all your parser rules and move them into their own lexer rules.

    Also note that there is an ANTLR4 grammar that can parse itself: https://github.com/antlr/grammars-v4/tree/master/antlr/antlr4