Search code examples
antlr4intellij-pluginantlrworks

Is it possible to have tokens that use tokens in ANTLR4


I'm new to Antlr and I'm trying to learn. I have a lexer with defined tokens. And another token that uses a subset of my tokens as so.

ADDQ: 'addq';
SUBQ: 'subq';
ANDQ: 'andq';
XORQ: 'xorq';
OP: (ADDQ | ANDQ | XORQ | SUBQ);

In my parser I have a rule called doOperation as so:

doOperation:
    OP REGISTER COMMA REGISTER;

When I test the rule using Intellij's ANTLR plugin. With an example: subq %rax, %rcx. I get an error that says, "mismatched input at subq, expect OP". What is the correct way to do this?


Solution

  • You can use token rules inside of other token rules, but when you do, there should be additional text that's matched around it. Something like:

    A: 'abc';
    B: A 'def';
    

    Given these rules the string "abc" would produce an A token and "abcdef" would produce a B token.

    However when you define one rule as an alternative of other rules like you did, you end up with multiple lexical rules that could match the same input. When lexical rules overlap, ANTLR (just like the vast majority of lexer generators) will first pick the rule that would lead to the longest match and, in case of ties, pick the one that appears first in the grammar.

    So given your rules, the input addq would produce an ADDQ token because ADDQ appears before OP in the grammar. Same for SUBQ and the others. So there's no way an OP token would ever be generated.

    Since you said that you don't use ADDQ, SUBQ etc. in your parser rules, you can make them fragments instead of token rules. Fragments can be used in token rules, but aren't themselves tokens. So you'll never end up with a SUBQ token because SUBQ isn't a token - you could only get OP tokens. In fact you don't even have to give them names at all, you could just "inline" them into OP like this:

    OP: 'addq' | 'subq' | 'andq' | 'xorq' ;
    

    Another option (one that you'd have to use if you were using SUBQ etc. directly) is to turn OP into a parser rule instead of a token. That way the input subq would still generate a SUBQ token, but that would be okay because now the op rule would accept a SUBQ token.