This is an extension of this question. I have written the pyparsing
code as one-to-one translation of grammar.
My DSL :
response:success
response:success AND extension:php OR extension:css
response:sucess AND (extension:php OR extension:css)
time >= 2020-01-09
time >= 2020-01-09 AND response:success OR os:windows
NOT reponse:success
response:success AND NOT os:windows
EBNF Grammar fro above DSL:
<expr> ::= <or>
<or> ::= <and> (" OR " <and>)*
<and> ::= <unary> ((" AND ") <unary>)*
<unary> ::= " NOT " <unary> | <equality>
<equality> ::= (<word> ":" <word>) | <comparison>
<comparison> ::= "(" <expr> ")" | (<word> (" > " | " >= " | " < " | " <= ") <word>)+
<word> ::= ("a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t" | "u"
| "v" | "w" | "x" | "y" | "z")+
Now I can get a list of tokens. the next step is to generate some sort of ast/structure so that I can generate the code from each node type?
After reading some examples on pyparsing
i think I got a vague idea of how to approach this:
1) I can use Group()
to group together relevant structure important for code generation and each grouping will probably represent a node in ast.
2) Along with Group()
i can use setParseAction()
to directly encode the python representation of my node object during the parsing phase itself instead of generating a structure first.
My approach in code:
AND = Keyword('AND')
OR = Keyword('OR')
NOT = Keyword('NOT')
word = Word(alphanums+'_')
expr = Forward()
Comparison = Literal('(') + expr + Literal(')') + OneOrMore(word + ( Literal('>') | Literal('>=') | Literal('<') | Literal('<=')) + word)
Equality = Group((word('searchKey') + Literal(':') + word('searchValue')) | Comparison)
Unary = Forward()
unaryNot = NOT + Unary
Unary << (unaryNot | Equality)
And = Group(Unary + ZeroOrMore(AND + Unary))
Or = And + ZeroOrMore(OR + And)
expr << Or
class AndNode:
def __init__(self, tokens):
self.tokens = tokens.asList()
def query(self):
pass #generate the relevant elastic search query here?
class ExactMatchNode:
def __init__(self, tokens):
self.tokens = tokens
def __repr__(self):
return "<ExactMatchNode>"
def query(self):
pass #generate the relevant elasticsearch query here?
Equality.setParseAction(ExactMatchNode)
Q1 = '''response:200 AND time:22 AND rex:32 OR NOT demo:good'''
result = expr.parseString(Q1)
print(result.dump())
This is my output:
[[<ExactMatchNode>, 'AND', <ExactMatchNode>, 'AND', <ExactMatchNode>], 'OR', ['NOT', <ExactMatchNode>]]
[0]:
[<ExactMatchNode>, 'AND', <ExactMatchNode>, 'AND', <ExactMatchNode>]
[1]:
OR
[2]:
['NOT', <ExactMatchNode>]
I am lost at this point as in how does this represents a tree structure? ex.
[<ExactMatchNode>, 'AND', <ExactMatchNode>, 'AND', <ExactMatchNode>]
should be like this instead?
[AND [<ExactMatchNode>, <ExactMatchNode>, <ExactMatchNode>]]
I guess this could be done in setParseAction
but I am not sure is this the right direction? or i should start modifying my grammar at this point. End goal of this DSL is to translate the given query to elasticsearch
json querying language.
EDIT: After trying out some things this is what I have:
class NotNode:
def __init__(self, tokens):
self.negatearg = tokens
#print(f'**** \n {self.negatearg} \n +++')
def __repr__(self):
return f'( NOT-> {self.negatearg} )'
class AndNode:
def __init__(self, tokens):
self.conds = tokens[0][::2]
#print(f'**** \n {tokens} \n +++')
def __repr__(self):
return f'( AND-> {self.conds} )'
def generate_query(self):
result = [cond.generate_query() for cond in self.conds]
return result
class ExactMatchNode:
def __init__(self, tokens):
self.tokens = tokens[0]
#print(f'**** \n {tokens} \n +++')
def __repr__(self):
return f"<ExactMatchNode {self.tokens.searchKey}={self.tokens.searchValue}>"
def generate_query(self):
return {
'term' : { self.tokens[0]: self.tokens[2]}
}
unaryNot.setParseAction(NotNode)
Equality.setParseAction(ExactMatchNode)
And.setParseAction(AndNode)
I can now use <some node object>.generate_query()
to get a query .
but one wierd thing i notice in my output below is:
[( AND-> [<ExactMatchNode response=200>, <ExactMatchNode time=22>, <ExactMatchNode rex=32>] ), 'OR', ( AND-> [( NOT-> ['NOT', <ExactMatchNode demo=good>] )] )]
second AND->
appended before NOT
node.
My question is still the same, is this even the correct way of using pyparsing or I missed something obvious and going in the wrong direction?
Attaching node classes using setParseAction is the best way I have found to build an AST from a hierchical grammar. You will probably not need the Group constructs if you use this method. The reason you are getting the second And is because your parser always produces an AndNode, even if there is only one operand with no additional AND operand
.
You can expand your And expression to only attach the AndNode parse action class if there is an operand AND operand
(and likewise for NOT and OR), something like:
And = (Unary + OneOrMore(AND + Unary)).addParseAction(AndNode) | Unary
Or = (And + OneOrMore(OR + And)).addParseAction(OrNode) | And
This is how pyparsing's infixNotation handles these kinds of operators.
My version of your parser, using infixNotation (I think the classes are almost all the same, maybe I tweaked the NotNode definition):
"""
<expr> ::= <or>
<or> ::= <and> (" OR " <and>)*
<and> ::= <unary> ((" AND ") <unary>)*
<unary> ::= " NOT " <unary> | <equality>
<equality> ::= (<word> ":" <word>) | <comparison>
<comparison> ::= "(" <expr> ")" | (<word> (" > " | " >= " | " < " | " <= ") <word>)+
<word> ::= ("a" | "b" | "c" | "d" | "e" | "f" | "g"
| "h" | "i" | "j" | "k" | "l" | "m" | "n"
| "o" | "p" | "q" | "r" | "s" | "t" | "u"
| "v" | "w" | "x" | "y" | "z")+
"""
import pyparsing as pp
NOT, AND, OR = map(pp.Keyword, "NOT AND OR".split())
word = ~(NOT | AND | OR) + pp.Word(pp.alphas.lower() + '-_')
date = pp.Regex(r"\d{4}-\d{2}-\d{2}")
operand = word | date
class ExactMatchNode:
def __init__(self, tokens):
self.tokens = tokens
def __repr__(self):
return "<ExactMatchNode>"
def query(self):
pass #generate the relevant elasticsearch query here?
class ComparisonNode:
def __init__(self, tokens):
self.tokens = tokens
def __repr__(self):
return "<ComparisonNode>"
def query(self):
pass #generate the relevant elasticsearch query here?
class NotNode:
def __init__(self, tokens):
self.negatearg = tokens[0][1]
#print(f'**** \n {self.negatearg} \n +++')
def __repr__(self):
return f'( NOT-> {self.negatearg} )'
class AndNode:
def __init__(self, tokens):
self.conds = tokens[0][::2]
#print(f'**** \n {tokens} \n +++')
def __repr__(self):
return f'( AND-> {self.conds} )'
def generate_query(self):
result = [cond.generate_query() for cond in self.conds]
return result
class OrNode:
def __init__(self, tokens):
self.conds = tokens[0][::2]
#print(f'**** \n {tokens} \n +++')
def __repr__(self):
return f'( OR-> {self.conds} )'
def generate_query(self):
result = [cond.generate_query() for cond in self.conds]
return result
expr = pp.infixNotation(operand,
[
(':', 2, pp.opAssoc.LEFT, ExactMatchNode),
(pp.oneOf("> >= < <="), 2, pp.opAssoc.LEFT, ComparisonNode),
(NOT, 1, pp.opAssoc.RIGHT, NotNode),
(AND, 2, pp.opAssoc.LEFT, AndNode),
(OR, 2, pp.opAssoc.LEFT, OrNode),
])
expr.runTests("""\
response:success
response:success AND extension:php OR extension:css
response:sucess AND (extension:php OR extension:css)
time >= 2020-01-09
time >= 2020-01-09 AND response:success OR os:windows
NOT reponse:success
response:success AND NOT os:windows
""")
prints
response:success
[<ExactMatchNode>]
response:success AND extension:php OR extension:css
[( OR-> [( AND-> [<ExactMatchNode>, <ExactMatchNode>] ), <ExactMatchNode>] )]
response:sucess AND (extension:php OR extension:css)
[( AND-> [<ExactMatchNode>, ( OR-> [<ExactMatchNode>, <ExactMatchNode>] )] )]
time >= 2020-01-09
[<ComparisonNode>]
time >= 2020-01-09 AND response:success OR os:windows
[( OR-> [( AND-> [<ComparisonNode>, <ExactMatchNode>] ), <ExactMatchNode>] )]
NOT reponse:success
[( NOT-> <ExactMatchNode> )]
response:success AND NOT os:windows
[( AND-> [<ExactMatchNode>, ( NOT-> <ExactMatchNode> )] )]