Search code examples
parsingantlrrule

ANTLR: Grammar with uppercase rules won't recognize input


I'm quite new to ANTLR and I need an explanation about its behaviour when recognising input string. From what I understood, rules staring by an uppercase letter are lexer rules while the ones starting by lowercase letter are parser rules.
I have the following language and I need to write the grammar accordingly:
language
So L should accept all the strings composed by:

  • non empty string with b and/or c
  • d
  • non empty string with a and/or c

This is the first grammar I wrote and I don't have problems with it when recognising the input bbbcbcbcbcdaaacacacaaaa:

start   : (alfa 'd' beta);

alfa    : ('b'|'c') | ('b'|'c') alfa;
beta    : ('a'|'c') | ('a'|'c') beta;

WS      : (' '|'\n'|'\t'|'\r')->skip;

But if I change it like the following one, the same string is not recognised anymore (below you can see the debugger result of ANTLRWorks):

enter image description here

start   : (alfa 'd' beta);

alfa    : ALFA_VALS | ALFA_VALS alfa;
beta    : BETA_VALS | BETA_VALS beta;

ALFA_VALS: ('b'|'c');
BETA_VALS: ('a'|'c');
WS      : (' '|'\n'|'\t'|'\r')->skip;

Furthermore, if I change ALFA_VALS and BETA_VALS to alfa_vals and beta_vals respectively, no problems occur.

Could someone provide me an explanation about this behaviour? Because I couldn't find a specific one addressing this problem.
Thank you so much!


Solution

  • The ANTLR lexer matches the longest possible sub-sequence of input, or if it can match the same input using multiple lexer rules, it uses the first rule that matches.

    The lexer knows no context and no parser state, and decides based solely on the characters on input.

    If you define ALFA_VALS and BETA_VALS in lexer this way, the input 'c' will always be matched as ALFA_VALS token.