Search code examples
tokenantlrantlr4tokenize

How to tokenize a word in multiple lines in ANTLR4


I want to tokenize the next word "SINGULAR EXECUTIVE OF MINIMUM QUANTIA" wrote in multiple lines. It is pretty simple if you have the full word in one line

foo bar foo bar foo bar SINGULAR EXECUTIVE OF MINIMUM QUANTIA foo bar foo bar foo bar foo bar
foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo barfoo bar foo bar foo bar

but I can not tokenize it when I have the word split into two lines

foo bar foo bar foo bar SINGULAR EXECUTIVE OF 
MINIMUM QUANTIA foo bar foo bar foo bar foo bar
foo bar foo bar foo bar foo bar foo bar foo bar foo bar foo bar 

This is my lexer

SPECIALWORD:S I N G U L A R ' ' E X E C U T I V E ' ' O F ' ' M I N I M U M ' ' Q U A N T I A 
fragment A:('a'|'A'|'á'|'Á');
......
......
fragment Z:('z'|'Z');
WORDUPPER: UCASE_LETTER UCASE_LETTER+;
WORDLOWER: LCASE_LETTER LCASE_LETTER+;
WORDCAPITALIZE: UCASE_LETTER LCASE_LETTER+;
LCASE_LETTER: 'a'..'z' | 'ñ' | 'á' | 'é' | 'í' | 'ó' | 'ú';
UCASE_LETTER: 'A'..'Z' | 'Ñ' | 'Á' | 'É' | 'Í' | 'Ó' | 'Ú';
INT: DIGIT+;
DIGIT: [0-9];  
WS : [ \t\r\n]+ -> skip;
ERROR: . ;

I have tried using line break into lexer rule

SPECIALWORD:S I N G U L A R ' ' E X E C U T I V E ' ' O F [\n] M I N I M U M ' ' Q U A N T I A

but it does not work, I guess because the lexer tokenize line by line.


Solution

  • So what you actually want is to allow a combination of the 5 words to become a certain token, while allowing an arbitrary number of whitespaces between them. This is actually the default work principle of ANTLR4 based parsers. Your attempt to put this all into one single lexer token is what makes things complicated.

    Instead define your (key) words as:

    SINGLUAR_SYMBOL: S I N G U L A R;
    EXECUTIVE_SYBOL: E X E C U T I V E;
    OF_SYMBOL: O F;
    MINIMUM_SYMBOL: M I N I M U M;
    QUANTIA_SYMBOL: Q U A N T I A;
    
    and define a parser rule to parse these as a special sentence:
    
    singularExec: SINGLUAR_SYMBOL EXECUTIVE_SYBOL OF_SYMBOL MINIMUM_SYMBOL QUANTIA_SYMBOL;
    

    Together with your WS rule that will match any combination of whitespaces between the individiual symbols.