Search code examples
antlrantlr4

match words with spaces as one token but disallow certain keyword tokens


I have the following token rules:

IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';

WORD: (DIGIT* (LOWERCASE | UPPERCASE | WORDSYMBOL)) (LOWERCASE | UPPERCASE | DIGIT | WORDSYMBOL)*;

This works, where something like my variable comes out as WORD WORD. I want to be able to have just the one token, which represents the whole thing.

I hanged it to:


IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';

WORD: (LOWERCASE | UPPERCASE | WORDSYMBOL)+ (' '* (LOWERCASE | UPPERCASE | WORDSYMBOL))*;

This fixed that, however it also captures character strings that I'd like classified as a keyword token as above.

For example if my variable then something shouldn't just be a single WORD token, it should be IF WORD THEN WORD.

I understand why it's being tokenized as it is (tokens consuming more of the input are preferred), but am not sure how to change the behaviour.


Solution

  • Unfortunately (for what you'd like to do), that's not how ANTLR's Tokenization works.

    (This is more a "logical" explanation rather than the actual implementation)

    When ANTLR is evaluating Lexer rules, it will find attempt to match each rule with characters in your input stream beginning with your current position in that input stream.

    Once it has the all of the input sequences that match, if there is one sequence that is longer than the rest, it will choose the Token type that produces the longest token. This is where your WORD rule is going to consume input until if finds something that doesn't match as a character in a WORD (and that will include "slurping up" keywords if they match the WORD pattern).

    (For completeness) If the Tokenizer finds more than one equal length match, the 1st rule that matches in your grammar will be the Token type assigned.


    You might have success with the following approach:

    Assumption: WORD cannot be one of your language keywords

    • make sure that the WORD rule is after all of your keyword rules so that they'll take priority.
    • add a Parser rule word: WORD+;
    • now use the word parser rule everywhere you would have used the RULE token.
    • Write a Listener that overrides enterWord() and merge all the WORDs into a single "word". (You could handle this step several ways, but this is one, fairly simple, approach)

    caveats:

    • There's a reason that languages do not typically allow for this. I suspect you'll encounter other complications/ambiguities down the road.
    • Performance MAY be impacted as ANTLR has to do more look-ahead to know when to backtrack.