Search code examples
parsinggrammarjavacc

JavaCC: treat white space like <OR>


I'm trying to build a simple grammar for Search Engine query. I've got this so far -

options {
  STATIC=false;
  MULTI=true;
  VISITOR=true;
}

PARSER_BEGIN(SearchParser)
package com.syncplicity.searchservice.infrastructure.parser;
public class SearchParser {}
PARSER_END(SearchParser)

SKIP :
{
  " "
| "\t"
| "\n"
| "\r"
}

<*> TOKEN : {
  <#_TERM_CHAR:       ~[ " ", "\t", "\n", "\r", "!", "(", ")", "\"", "\\", "/" ] >
| <#_QUOTED_CHAR:     ~["\""] >
| <#_WHITESPACE:      ( " " | "\t" | "\n" | "\r" | "\u3000") >
}

TOKEN :
{
  <AND:              "AND">
| <OR:               "OR">
| <NOT:              ("NOT" | "!")>
| <LBRACKET:         "(">
| <RBRACKET:         ")">
| <TERM:             (<_TERM_CHAR>)+ >
| <QUOTED:           "\"" (<_QUOTED_CHAR>)+ "\"">
}

/** Main production. */

ASTQuery query() #Query: {}
{
  subQuery()
  ( <AND> subQuery() #LogicalAnd
  | <OR> subQuery() #LogicalOr
  | <NOT> subQuery() #LogicalNot
  )*
  {
    return jjtThis;
  }
}

void subQuery() #void: {}
{
  <LBRACKET> query() <RBRACKET> | term() | quoted()
}

void term() #Term:
{
  Token t;
}
{
  (
    t=<TERM>
  )
  {
    jjtThis.value = t.image;
  }
}

void quoted() #Quoted:
{
  Token t;
}
{
  (
    t=<QUOTED>
  )
  {
    jjtThis.value = t.image;
  }
}

Looks like it works as I wanted to, e.g it can handle AND, OR, NOT/!, single terms and quoted text.

However I can't force it to handle whitespaces between terms as OR operator. E.g hello world should be treated as hello OR world

I've tried all obvious solutions, like <OR: ("OR" | " ")>, removing " " from SKIP, etc. But it still doesn't work.


Solution

  • Ok. Suppose you actually do want to require that any missing ORs be replaced by at least one space. Or to put it another way, if there is one or more white spaces where an OR would be permitted, then that white space is considered to be an OR.

    As in my other solution, I'll treat NOT as a unary operator and give NOT precedence over AND and AND precedence over either sort of OR.

    Change

    SKIP : { " " | "\t" | "\n" | "\r" }
    

    to

    TOKEN : {<WS : " " | "\t" | "\n" | "\r" > }
    

    Now use a grammar like this

    query() --> query0() ows() <EOF>
    query0() --> query1()
                ( LOOKAHEAD( ows() <OR> | ws() (<NOT> | <LBRACKET> | <TERM> | <QUOTED>) )
                  ( ows() (<OR>)?
                    query1()
                )* 
    query1() --> query2() (LOOKAHEAD(ows() <AND>) ows() <AND> query2())*
    query2() --> ows() (<NOT> query2() | subquery())
    subquery() --> <LBRACKET> query0() ows() <RBRACKET> | <TERM> | <QUOTED>
    ows() --> (<WS>)*
    ws() --> (<WS>)+