First I would like to say that I am just starting using pyparsing
, and I need some help with the following. So here is the context:
I have a file with text lines. Each line can have one sequence
or can be a set of sequences combined with the concurrent operator ||
. The representation can either be (seq1)||(seq2)||...
etc or simply seq1
.
A sequence seq_
is a set of events, starting with a question
followed by one or more answers, the sequence order is defined by the order of the answers on the line with the operator ->
or =>
in between. The interpretation is that the system reading the line should execute the question
and then check that the answers are identical to the ones defined in the line, in the defined order, hence the sequence. The previous definition of a line where several sequences running concurrently simply means that the question at the beginning of each sequence should be executed first and only then the system will check the answers for each question independently (concurrent execution and check).
The question
format is as such qtn(elm,sub,num1[,num2,num3,...])
where what is between []
is optional and where elm
and sub
are names and num_
are numbers.
The answer is more complicated and can be one of the following:
ans(elma,acta,suba,num5[,num6,num7,...][elma.pr1=num8[,elma.pr2=num9]])[<timeout]
, meaning that some num_
are optional and the timeout too.ans(elma,acta,suba,num5[,num6,num7,...])[<timeout] | prm(elma.pr1=num8[,elma.pr2=num9])
where the |
operator indicates a OR
, meaning that one answer OR
the other is enough to consider that the global answer is correct.ans(elma,acta,suba,num5[,num6,num7,...])[<timeout] & prm(elma.pr1=num)
where the &
operator indicates a AND
, meaning that both answers are required to consider that the global answer is correct.|
and &
operators mixed in with ()
. For example we could have ans(elma,acta,suba,num5[,num6,num7,...])[<timeout] | (prm(elma.pr1=num8) & prm(elmb.pr2=num9))
or something more complex with or without ()
. There is no operator priority, only the ()
are indicating some order.So my idea is to define the general syntax of the different big elements (and I am not sure it is the best way):
qtn = Regex(r'qtn\([a-z0-9]+,[a-z]+(,[ex0-9_.]+)+\)')
ans = Combine(Regex(r'ans\([a-z0-9]+,[a-z]+(,[a-z0-9_.]+)+\)') + Regex('(<[0-9]+)*'))
Perhaps it would be better to define separately what a num
is, what a timeout
is, what an id such as elma
is, and compose the answers from those definitions. After having each element of a seq in a list and having the list of all sequences in a line, I am planning to interpret each element in a later part of the code.
Where I am stuck now is on how to define the answer general syntax, which is complex, in such way that the output of parsing can be evaluated according to the ()
and the &
and the |
operator. I am trying to understand the fourFn.py
pyparsing example, but so far I am clueless.
Any help you could give me is welcome.
Your Regex and sample strings were good inputs for defining a simple parser, usually done a little more formally as a BNF, but these were sufficient. Here is a basic implementation of your simple ans format, you should be able to generalize from here what the question would look like:
import pyparsing as pp
LPAR, RPAR, COMMA, LT = map(pp.Suppress, "(),<")
element = pp.Word(pp.alphas.lower(), pp.alphanums.lower())
action = pp.Word(pp.alphas.lower())
subject = pp.Word(pp.alphas.lower())
number = pp.pyparsing_common.number()
timeout_expr = LT + number("timeout")
# put the pieces together into a larger expression
ans_expr = pp.Group(pp.Literal('ans')
+ LPAR
+ element('element')
+ COMMA
+ action('action')
+ COMMA
+ subject('subject')
+ COMMA
+ number('num*')
+ (COMMA + number('num*'))[...]
+ RPAR
+ pp.Optional(timeout_expr)
)
# use runTests to try it out, will also flag parse errors
ans_expr.runTests("""
ans(first, act, sub, 1000)
ans(first, act, sub, 1000, 2000)
ans(first, act, sub, 1000, 2000) < 50
# example of a parsing error
ans(first, act1, sub, 1000)
""")
Will print:
ans(first, act, sub, 1000)
[['ans', 'first', 'act', 'sub', 1000]]
[0]:
['ans', 'first', 'act', 'sub', 1000]
- action: 'act'
- element: 'first'
- num: [1000]
- subject: 'sub'
ans(first, act, sub, 1000, 2000)
[['ans', 'first', 'act', 'sub', 1000, 2000]]
[0]:
['ans', 'first', 'act', 'sub', 1000, 2000]
- action: 'act'
- element: 'first'
- num: [1000, 2000]
- subject: 'sub'
ans(first, act, sub, 1000, 2000) < 50
[['ans', 'first', 'act', 'sub', 1000, 2000, 50]]
[0]:
['ans', 'first', 'act', 'sub', 1000, 2000, 50]
- action: 'act'
- element: 'first'
- num: [1000, 2000]
- subject: 'sub'
- timeout: 50
# example of a parsing error
ans(first, act1, sub, 1000)
^
FAIL: Expected ',', found '1' (at char 14), (line:1, col:15)
Note the use of results names to help you access the results by name, which will make your parser easier to maintain and use.