I am new to antlr4 and wonder if it can do what I am looking for. Here is an example input:
There is a lot of text
in this file that i do not care
about
Lithium 20 g/ml
Bor that should be skipped
Potassium 300g/ml
...
and code:
SempredParser.g4
parser grammar SempredParser;
options { tokenVocab=SempredLexer ;}
file : line+ EOF;
line : KEYWORD (NUM UNIT)+ '\n'+;
SempredLexer.g4:
lexer grammar SempredLexer;
//lexer rules
KEYWORD : ('Lithium' | 'Potassium' ) ;
NL : '\n';
NUM : [0-9]+ ('.'[0-9]+)? ;
UNIT : 'g/ml';
UNKNOWN : . -> skip ;
I would like to skip all the lines that do not contain a KEYWORD (I have around 100 KEYWORDS). Note that I only use '\n' as delimiter here and would ideally not have it parsed to the output.
I read about Island grammars in the Definitive guide and also tried using lexer modes but could not make it work that way. Any hints and help greatly appreciated.
You are pretty close, just avoid to define a linebreak token twice. This grammar works for me (I put it into a combined grammar file):
grammar IslandTest;
start: NL+ line+ EOF;
line: KEYWORD (NUM UNIT)+ NL+;
KEYWORD: ('Lithium' | 'Potassium');
NUM: [0-9]+ ('.' [0-9]+)?;
UNIT: 'g/ml';
NL: '\n';
UNKNOWN: . -> skip;
With your input that gives me this parse tree:
Note also: you cannot avoid the NL token in your output, because you decided to make your line
parse rule line based, which requires the newline token.