Search code examples
parsingantlr4grammar

antlr error message seems self-contradictory? What am I doing wrong?


I have the following antlr4 grammar in file xbf.g4

grammar xbf;

prog: ( struct_def )* ;

type:  ( type_uint | type_float | type_string ) ;

type_uint: U8 | U16 | U32 | U64 ;
type_float: F32 | F64 ;
type_string: '"' STRLIT? '"' ;

struct_def: STRUCT IDENT '{' ( member )* '}' ;
member: IDENT ':' type ',' ;

STRUCT: 'struct' ;
IDENT: [a-zA-Zα-ωΑ-ΩА-Яа-я][a-zA-Z0-9_α-ωΑ-ΩА-Яа-я]* ;

U8: 'u8' ;
U16: 'u16' ;
U32: 'u32' ;
U64: 'u64' ;

F32: 'f32' ;
F64: 'f64' ;

STRLIT: '"' (~[\r\n"] | '\\"')* '"' ;
WS: [ \t\n]+ -> skip;

The program parses the following file:

struct vec3d {
  x : f32,
  y : f32,
  z : f32,
}

The error message is

line 2:6 mismatched input 'f32' expecting {'"', 'u8', 'u16', 'u32', 'u64', 'f32', 'f64'}
line 3:6 mismatched input 'f32' expecting {'"', 'u8', 'u16', 'u32', 'u64', 'f32', 'f64'}
line 4:6 mismatched input 'f32' expecting {'"', 'u8', 'u16', 'u32', 'u64', 'f32', 'f64'}

So it's expected one of the symbols including f32, it finds f32, but it's an error?


Solution

  • The lexer creates tokens using the following 2 rules:

    1. try to match as many characters as possible
    2. when 2 (or more) rules match the same characters, let the one defined first "win"

    Because of rule 2, the input f32 is tokenized as a IDENT token, not as a F32 token. The solution: move your IDENT rule below all U... and F... rules:

    STRUCT: 'struct' ;
    
    U8: 'u8' ;
    U16: 'u16' ;
    U32: 'u32' ;
    U64: 'u64' ;
    
    F32: 'f32' ;
    F64: 'f64' ;
    
    IDENT: [a-zA-Zα-ωΑ-ΩА-Яа-я][a-zA-Z0-9_α-ωΑ-ΩА-Яа-я]* ;