I have the following token rules:
IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';
WORD: (DIGIT* (LOWERCASE | UPPERCASE | WORDSYMBOL)) (LOWERCASE | UPPERCASE | DIGIT | WORDSYMBOL)*;
This works, where something like my variable
comes out as WORD WORD
. I want to be able to have just the one token, which represents the whole thing.
I hanged it to:
IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';
WORD: (LOWERCASE | UPPERCASE | WORDSYMBOL)+ (' '* (LOWERCASE | UPPERCASE | WORDSYMBOL))*;
This fixed that, however it also captures character strings that I'd like classified as a keyword token as above.
For example if my variable then something
shouldn't just be a single WORD
token, it should be IF WORD THEN WORD
.
I understand why it's being tokenized as it is (tokens consuming more of the input are preferred), but am not sure how to change the behaviour.
Unfortunately (for what you'd like to do), that's not how ANTLR's Tokenization works.
(This is more a "logical" explanation rather than the actual implementation)
When ANTLR is evaluating Lexer rules, it will find attempt to match each rule with characters in your input stream beginning with your current position in that input stream.
Once it has the all of the input sequences that match, if there is one sequence that is longer than the rest, it will choose the Token type that produces the longest token. This is where your WORD
rule is going to consume input until if finds something that doesn't match as a character in a WORD
(and that will include "slurping up" keywords if they match the WORD
pattern).
(For completeness) If the Tokenizer finds more than one equal length match, the 1st rule that matches in your grammar will be the Token type assigned.
You might have success with the following approach:
Assumption: WORD
cannot be one of your language keywords
WORD
rule is after all of your keyword rules so that they'll take priority.word: WORD+;
word
parser rule everywhere you would have used the RULE
token.enterWord()
and merge all the WORD
s into a single "word". (You could handle this step several ways, but this is one, fairly simple, approach)caveats: