I have a bunch of sentences which I need to parse and convert to corresponding regex search code. Examples of my sentences -
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
-This means in the line, phrase one
comes somewhere before
phrase2
and phrase3
. Also, the line must start with Therefore we
LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr
-This means I need to allow upto 4 words between the first 2 phrases and upto 3 words between last 2 phrases
Using help from Paul Mcguire (here), the following grammar was written -
from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString,
infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)
LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
"""LINE_CONTAINS LINE_STARTSWITH """.split())
NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())
lpar=Suppress('{')
rpar=Suppress('}')
keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR,
BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use
phrase_word = ~keyword + (Word(alphanums + '_'))
upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)
phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))
phrase_expr = infixNotation(phrase_term,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(NOT, 1, opAssoc.RIGHT,),
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""
My question now is - How do I access the parsed elements in order to convert the sentences to my regex code. For this, I tried the following -
parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)
The result of the above code for sample1
was -
{}
[[['LINE_CONTAINS', [[['sentence', 'one'], 'BEFORE', [['sentence2'], 'AND', ['sentence3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
([([(['LINE_CONTAINS', ([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {})], {'phrase': [(([([(['sentence', 'one'], {}), 'BEFORE', ([(['sentence2'], {}), 'AND', (['sentence3'], {})], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]}), 'AND', (['LINE_STARTSWITH', ([(['Therefore', 'we'], {})], {})], {'phrase': [(([(['Therefore', 'we'], {})], {}), 1)], 'line_directive': [('LINE_STARTSWITH', 0)]})], {})], {})
The result of the above code for sample2
was -
{'phrase': [[['abcd', {'numberofwords': 4}, 'xyzw', {'numberofwords': 3}, 'pqrs'], 'BEFORE', ['something', 'else']]], 'line_directive': 'LINE_CONTAINS'}
[['LINE_CONTAINS', [[['abcd', ['upto', 4, 'words'], 'xyzw', ['upto', 3, 'words'], 'pqrs'], 'BEFORE', ['something', 'else']]]]]
([(['LINE_CONTAINS', ([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {})], {'phrase': [(([([(['abcd', (['upto', 4, 'words'], {'numberofwords': [(4, 1)]}), 'xyzw', (['upto', 3, 'words'], {'numberofwords': [(3, 1)]}), 'pqrs'], {}), 'BEFORE', (['something', 'else'], {})], {})], {}), 1)], 'line_directive': [('LINE_CONTAINS', 0)]})], {})
My questions based on the above output are -
asDict()
method give no output for sample1
but does for sample2
?print (parsed.numberofwords)
or parsed.line_directive
or parsed.line_term
, it gives me nothing. How can I access these elements in order to use them to build my regex codes?To answer your printing questions. 1) pprint
is there to pretty print a nested list of tokens, without showing any results names - it is essentially a wraparound for calling pprint.pprint(results.asList())
. 2) asDict()
is there to do conversion of your parsed results to an actual Python dict, so it only shows the results names (with nesting if you have names within names).
To view the contents of your parsed output, you are best off using print(result.dump())
. dump()
shows both the nesting of the results and any named results along the way.
result = line_contents_expr.parseString(sample2)
print(result.dump())
I also recommend using expr.runTests
to give you dump()
output as well as any exceptions and exception locators. With your code, you could most easily do this using:
line_contents_expr.runTests([sample1, sample2])
But I also suggest you step back a second and think about just what this {upto n words}
business is all about. Look at your samples and draw rectangles around the line terms, and then within the line terms draw circles around the phrase terms. (This would be a good exercise in leading up to writing for yourself a BNF description of this grammar, which I always recommend as a getting-your-head-around-the-problem step.) What if you treated the upto
expressions as another operator? To see this, change phrase_term
back to the way you had it:
phrase_term = Group(OneOrMore(phrase_word))
And then change your first precedence entry in defining a phrase expression to:
((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,),
Or give some thought to maybe having upto
operator at a higher or lower precedence than BEFORE, AFTER, and JOIN, and adjust the precedence list accordingly.
With this change, I get this output from calling runTests on your samples:
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
[[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]]
[0]:
[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]
[0]:
['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]]
- line_directive: 'LINE_CONTAINS'
- phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]
[0]:
[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]
[0]:
['phrase', 'one']
[1]:
BEFORE
[2]:
[['phrase2'], 'AND', ['phrase3']]
[0]:
['phrase2']
[1]:
AND
[2]:
['phrase3']
[1]:
AND
[2]:
['LINE_STARTSWITH', [['Therefore', 'we']]]
- line_directive: 'LINE_STARTSWITH'
- phrase: [['Therefore', 'we']]
[0]:
['Therefore', 'we']
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
[['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]]
[0]:
['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]
- line_directive: 'LINE_CONTAINS'
- phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]
[0]:
[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]
[0]:
['abcd']
[1]:
['upto', 4, 'words']
- numberofwords: 4
[2]:
['xyzw']
[3]:
['upto', 3, 'words']
- numberofwords: 3
[4]:
['pqrs']
[5]:
BEFORE
[6]:
['something', 'else']
You can iterate over these results and pick them apart, but you are rapidly reaching the point where you should look at building executable nodes from the different precedence levels - see the SimpleBool.py example on the pyparsing wiki for how to do this.
EDIT: Please review this pared-down version of a parser for phrase_expr
, and how it creates Node
instances that themselves generate the output. See how numberofwords
is accessed on the operator in the UpToNode
class. See how "xyz abc" gets interpreted as "xyz AND abc" with an implicit AND operator.
from pyparsing import *
import re
UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
word = ~keyword + Word(alphas)
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" % re.escape(self.tokens[0])
def __repr__(self):
return repr(self.tokens[0])
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '.*'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '|'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
class UpToNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = tokens[0].generate()
word_re = r"\s+\S+"
space_re = r"\s+"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
return ret
def __repr__(self):
tokens = self.tokens[0]
ret = repr(tokens[0])
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
return ret
IMPLICIT_AND = Empty().setParseAction(replaceWith("AND"))
phrase_expr = infixNotation(word.setParseAction(LiteralNode),
[
(upto_expr, 2, opAssoc.LEFT, UpToNode),
(AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
])
tests = """\
xyz
xyz abc
xyz {upto 4 words} def""".splitlines()
for t in tests:
t = t.strip()
if not t:
continue
print(t)
try:
parsed = phrase_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
print(parsed)
print(parsed[0].generate())
print()
prints:
xyz
['xyz']
(xyz)
xyz abc
['xyz' AND 'abc']
(xyz).*(abc)
xyz {upto 4 words} def
['xyz' {0-4 WORDS} 'def']
(xyz)((\s+\S+){0,4}\s+)(def)
Expand on this to support your LINE_xxx
expressions.