Search code examples
parsingantlr4lexer

What's the best way to handle optional tokens in antlr4


Suppose I have following input:

Great University
Graduated in 2010
Some University
09/2009 - 06/2011
Nice University
06/2011

I want to handle years of studying. My grammar looks like that:

education:
    (section)*
    EOF
    ;

section:
    (school | years)+
   ;

degree:     WORD* DEGREE WORD* SEPARATOR;
years:      WORD* ( (YEAR_START '-')? YEAR_END) WORD* SEPARATOR;
WS          : [ \t\r]+ -> skip;
SEPARATOR   : (NEWLINE | COMMA);
COMMA       : ',';
NEWLINE     : '\n';
SCHOOL      : ('university' | 'University' | 'school' | 'School');
WORD        : [a-zA-Z'()]+;
YEAR_START  : YEAR;
YEAR_END    : YEAR;
YEAR        : (DIGIT DIGIT '/')? [1-2] DIGIT DIGIT DIGIT;
DIGIT       : [0-9];

I'm getting following errors:

line 1:17 mismatched input '\n' expecting '-'
line 6:17 mismatched input '\n' expecting '-'

How can I handle optional start year via grammar?


Solution

  • The lexer can assign only one token type to one pattern. You expect it to assign a year pattern to three token types and to decide at runtime which one is the correct one. This is not how ANTLR works.

    In your case all years (not only the optional one) will be captured by the first rule, i.e. YEAR_START. This means following tokenization

    "Graduated in 2010" -> WORD WORD YEAR_START
    

    The only matching rule is

     years:      WORD* ( (YEAR_START '-')? YEAR_END) WORD* SEPARATOR;
    

    but the '-' is missing.

    The grammar should work if you delete the YEAR_START and YEAR_END rules and replace all occurrences by YEAR. Probably YEAR_START and YEAR_END have the purpose to distinguish start and end, yet for this purpose there exist labels.

    If this does not work, please post your complete grammar; the one you posted does e.g. not contain a rule for DEGREE.