Search code examples
stringcpu-wordantlrlexerquotes

Antlr Lexer Quoted String Predicate


I'm trying to build a lexer to tokenize lone words and quoted strings. I got the following:

STRING:    QUOTE (options {greedy=false;} : . )* QUOTE ;
WS    :    SPACE+ { $channel = HIDDEN; } ;
WORD  :    ~(QUOTE|SPACE)+ ;

For the corner cases, it needs to parse:

"string" word1" word2

As three tokens: "string" as STRING and word1" and word2 as WORD. Basically, if there is a last quote, it needs to be part of the WORD were it is. If the quote is surrounded by white spaces, it should be a WORD.

I tried this rule for WORD, without success:

WORD:    ~(QUOTE|SPACE)+
    |    (~(QUOTE|SPACE)* QUOTE ~QUOTE*)=> ~(QUOTE|SPACE)* QUOTE ~(QUOTE|SPACE)* ; 

Solution

  • I finally found something that could do the trick without resorting to writing Java code:

        fragment QUOTE
                :   '"' ;
        fragment SPACE
                :   (' '|'\r'|'\t'|'\u000C'|'\n') ;
    
        WS      :   SPACE+ {$channel=HIDDEN;};
        PHRASE  :   QUOTE (options {greedy=false;} : . )* QUOTE ;
        WORD    :   (~(QUOTE|SPACE)* QUOTE ~QUOTE* EOF)=> ~(QUOTE|SPACE)* QUOTE ~(SPACE)*
                |   ~(QUOTE|SPACE)+ ;
    

    That way, the predicate differentiate/solves for both:

        PHRASE  :   QUOTE (options {greedy=false;} : . )* QUOTE ;
    

    and

                |   ~(QUOTE|SPACE)+ ;