Search code examples
parsingantlrantlrworks

Support optional quotes in a Boolean expression


Background

I have been using ANTLRWorks (V 1.4.3) for a few days now and trying to write a simple Boolean parser. The combined lexer/parser grammar below works well for most of the requirements including support for quoted white-spaced text as operands for a Boolean expression.


Problem

I would like the grammar to work for white-spaced operands without the need of quotes.


Example

For example, expression-

"left right" AND center

should have the same parse tree even after dropping the quotes-

left right AND center.

I have been learning about backtracking, predicates etc but can't seem to find a solution.


Code

Below is the grammar I have got so far. Any feedback on the foolish mistakes is appreciated :).

Lexer/Parser Grammar

grammar boolean_expr;

options {
    TokenLabelType=CommonToken;
    output=AST;
    ASTLabelType=CommonTree;
}

@modifier{public}
@ctorModifier{public}

@lexer::namespace{Org.CSharp.Parsers}
@parser::namespace{Org.CSharp.Parsers}

public
evaluator 
    : expr EOF 
    ;

public
expr 
    : orexpr 
    ;

public  
orexpr 
    : andexpr (OR^ andexpr)* 
    ;

public
andexpr 
    : notexpr (AND^ notexpr)* 
    ;

public
notexpr 
    : (NOT^)? atom 
    ;

public
atom 
    : word | LPAREN! expr RPAREN! 
    ;

public
word  
    :  QUOTED_TEXT | TEXT 
    ;

/*
 * Lexer Rules
 */

LPAREN 
    : '(' 
    ;

RPAREN 
    : ')' 
    ;

AND 
    : 'AND'
    ;

OR 
    : 'OR'
    ;

NOT
    : 'NOT'
    ;

WS 
    :  ( ' ' | '\t' | '\r' | '\n')  {$channel=HIDDEN;}  
    ;

QUOTED_TEXT  
    : '"' (LETTER | DIGIT | ' ' | ',' | '-')+ '"'
    ;

TEXT
    : (LETTER | DIGIT)+ 
    ;

/*
Fragment lexer rules can be used by other lexer rules, but do not return tokens by themselves
*/
fragment DIGIT  
    :   ('0'..'9') 
    ; 

fragment LOWER  
    :   ('a'..'z') 
    ; 

fragment UPPER  
    :   ('A'..'Z') 
    ; 

fragment LETTER 
    :   LOWER | UPPER 
    ; 

Solution

  • Simply let TEXT in your atom rule match once or more: TEXT+. When it matches a TEXT token more than once, you'll also want to create a custom root node for these TEXT tokens (I added an imaginary token called WORD in the grammar below).

    grammar boolean_expr;
    
    options {
        output=AST;
    }
    
    tokens {
      WORD;
    }
    
    evaluator 
        : expr EOF 
        ;
    
    ...
    
    word  
        : QUOTED_TEXT 
        | TEXT+       -> ^(WORD TEXT+)
        ;
    
    ...
    

    Your input "left right AND center" would now be parsed as follows:

    enter image description here