I am trying to use ANTLR 4 to build a parser. Ideally, I want to put my parser and lexer grammars in separate files (because I'll be adding more grammars later that will use some of these common tokens).
As an example, here's a very simple input that my parser should accept:
<|
TITLE: Test Title
|>
Now, when I combine my lexer and parser rules in one file (called PreambleCondensed.g4
), I have no token recognition issues.
grammar PreambleCondensed;
preamble: PREAMBLE_OPENER preambleField+ PREAMBLE_CLOSER;
preambleField: titleDef
| composerDef
| tsDef
| subdivDef
;
titleDef: KW_TITLE PREAMBLE_SEP NAME NEWLINE ; // give the groove a title
composerDef: KW_COMPOSER PREAMBLE_SEP NAME NEWLINE ; // name the groove's composer
tsDef: KW_TS PREAMBLE_SEP timeSignature NEWLINE ; // specify the time signature
subdivDef: KW_SUBDIV PREAMBLE_SEP subdivs=INT NEWLINE ;
timeSignature: INT '/' INT ;
PREAMBLE_OPENER: '<|' NEWLINE;
WS: [ \t]+ -> skip ;
INT: [0-9]+ ;
NEWLINE: '\r'? '\n' ;
PREAMBLE_SEP : ':' ;
KW_TITLE: 'TITLE' ;
KW_COMPOSER: 'COMPOSER' ;
KW_TS: 'TS' | 'TIME SIGNATURE' ;
KW_SUBDIV: 'SUBDIVS' | 'SUBDIVBY' | 'N' ;
NAME: [a-zA-Z] [a-zA-Z ]* ;
PREAMBLE_CLOSER: '|>' ;
However, when I split them into separate files:
PreambleLex.g4
:
lexer grammar PreambleLex;
PREAMBLE_OPENER: '<|' -> mode(PREAMBLE);
WS: [ \t]+ -> skip ;
INT: [0-9]+ ;
NEWLINE: '\r'? '\n' ;
mode PREAMBLE;
KW_TITLE: 'TITLE:' WS* ;
KW_COMPOSER: 'COMPOSER:' WS* ;
KW_TS: ( 'TS' | 'TIME SIGNATURE' ) ':' WS* ;
KW_SUBDIV: ( 'SUBDIVS' | 'SUBDIVBY' | 'N' ) ':' WS* ;
NAME: ~[\r\n]+ ;
PREAMBLE_CLOSER: '|>' -> mode(DEFAULT_MODE) ;
And Preamble.g4
:
grammar Preamble;
options {tokenVocab = PreambleLex;}
preamble: PREAMBLE_OPENER preambleField+ PREAMBLE_CLOSER;
preambleField: titleDef
| composerDef
| tsDef
| subdivDef
;
titleDef: KW_TITLE NAME NEWLINE ; // give the groove a title
composerDef: KW_COMPOSER NAME NEWLINE ; // name the groove's composer
tsDef: KW_TS timeSignature NEWLINE ; // specify the time signature
subdivDef: ( KW_SUBDIV subdivs=INT NEWLINE ) ;
timeSignature: INT '/' INT ;
I build the lexer:
antlr4 PreambleLex.g4 -o testable/preamble
Then build the parser:
antlr4 Preamble.g4 -lib testable/preamble -o testable/preamble -no-listener -no-visitor
javac testable/preamble/*.java
When I attempt the same input with this, I get:
line 1:0 token recognition error at: '<'
line 1:1 token recognition error at: '|'
line 1:2 token recognition error at: '\n'
line 2:0 token recognition error at: '\t'
line 2:1 token recognition error at: 'T'
line 2:2 token recognition error at: 'I'
line 2:3 token recognition error at: 'T'
line 2:4 token recognition error at: 'L'
line 2:5 token recognition error at: 'E'
line 2:6 token recognition error at: ':'
line 2:7 token recognition error at: ' '
line 2:8 token recognition error at: 'T'
line 2:9 token recognition error at: 'e'
line 2:10 token recognition error at: 's'
line 2:11 token recognition error at: 't'
line 2:12 token recognition error at: ' '
line 2:13 token recognition error at: 'T'
line 2:14 token recognition error at: 'i'
line 2:15 token recognition error at: 't'
line 2:16 token recognition error at: 'l'
line 2:17 token recognition error at: 'e'
line 2:18 token recognition error at: '\n'
line 3:0 token recognition error at: '|'
line 3:1 token recognition error at: '>'
line 3:2 token recognition error at: '\n'
line 4:0 mismatched input '<EOF>' expecting '<|'
Of course, when I split them up, I use the lexical modes feature, but I have also tried without that, and I run into the same issue.
Because everything works as expected when I put the lexer/parser rules in the same file, I suspect there's something wrong with the way I'm generating the files, or the way I'm telling the parser which grammar to use. Based on what I've seen in the ANTLR4 book and other example, I think I'm doing everything correctly, but obviously not. What is wrong here?
The first line of Preamble.g4
needs to specify that it is a Parser grammar:
parser grammar Preamble;
Once you do this, it'll pull in your Lexer, but you'll also have an issue with the line:
timeSignature: INT '/' INT ;
cannot create implicit token for string literal in non-combined grammar: '/'
You'll need to define this as a Lexer rule. There are still more issues you'll need to resolve, but that should address your immediate problem.