Search code examples
pythonpython-3.xpyparsing

How to parse groups with operator and brackets


First I would like to say that I am just starting using pyparsing, and I need some help with the following. So here is the context:

I have a file with text lines. Each line can have one sequence or can be a set of sequences combined with the concurrent operator ||. The representation can either be (seq1)||(seq2)||... etc or simply seq1.

A sequence seq_ is a set of events, starting with a question followed by one or more answers, the sequence order is defined by the order of the answers on the line with the operator -> or => in between. The interpretation is that the system reading the line should execute the question and then check that the answers are identical to the ones defined in the line, in the defined order, hence the sequence. The previous definition of a line where several sequences running concurrently simply means that the question at the beginning of each sequence should be executed first and only then the system will check the answers for each question independently (concurrent execution and check).

The question format is as such qtn(elm,sub,num1[,num2,num3,...]) where what is between [] is optional and where elm and sub are names and num_ are numbers.

The answer is more complicated and can be one of the following:

  • ans(elma,acta,suba,num5[,num6,num7,...][elma.pr1=num8[,elma.pr2=num9]])[<timeout], meaning that some num_ are optional and the timeout too.
  • ans(elma,acta,suba,num5[,num6,num7,...])[<timeout] | prm(elma.pr1=num8[,elma.pr2=num9]) where the | operator indicates a OR, meaning that one answer ORthe other is enough to consider that the global answer is correct.
  • ans(elma,acta,suba,num5[,num6,num7,...])[<timeout] & prm(elma.pr1=num) where the & operator indicates a AND, meaning that both answers are required to consider that the global answer is correct.
  • and of course the answer can be a combination of elements with the | and & operators mixed in with (). For example we could have ans(elma,acta,suba,num5[,num6,num7,...])[<timeout] | (prm(elma.pr1=num8) & prm(elmb.pr2=num9)) or something more complex with or without (). There is no operator priority, only the () are indicating some order.

So my idea is to define the general syntax of the different big elements (and I am not sure it is the best way):

  • the question would be:
qtn = Regex(r'qtn\([a-z0-9]+,[a-z]+(,[ex0-9_.]+)+\)')
  • one of the simple answers would be:
ans = Combine(Regex(r'ans\([a-z0-9]+,[a-z]+(,[a-z0-9_.]+)+\)') + Regex('(<[0-9]+)*'))

Perhaps it would be better to define separately what a num is, what a timeout is, what an id such as elma is, and compose the answers from those definitions. After having each element of a seq in a list and having the list of all sequences in a line, I am planning to interpret each element in a later part of the code.

Where I am stuck now is on how to define the answer general syntax, which is complex, in such way that the output of parsing can be evaluated according to the () and the & and the | operator. I am trying to understand the fourFn.py pyparsing example, but so far I am clueless.

Any help you could give me is welcome.


Solution

  • Your Regex and sample strings were good inputs for defining a simple parser, usually done a little more formally as a BNF, but these were sufficient. Here is a basic implementation of your simple ans format, you should be able to generalize from here what the question would look like:

    import pyparsing as pp
    
    LPAR, RPAR, COMMA, LT = map(pp.Suppress, "(),<")
    
    element = pp.Word(pp.alphas.lower(), pp.alphanums.lower())
    action = pp.Word(pp.alphas.lower())
    subject = pp.Word(pp.alphas.lower())
    number = pp.pyparsing_common.number()
    timeout_expr = LT + number("timeout")
    
    # put the pieces together into a larger expression
    ans_expr = pp.Group(pp.Literal('ans')
                        + LPAR
                        + element('element')
                        + COMMA
                        + action('action')
                        + COMMA
                        + subject('subject')
                        + COMMA
                        + number('num*')
                        + (COMMA + number('num*'))[...]
                        + RPAR
                        + pp.Optional(timeout_expr)
                        )
    
    # use runTests to try it out, will also flag parse errors
    ans_expr.runTests("""
        ans(first, act, sub, 1000)
        ans(first, act, sub, 1000, 2000)
        ans(first, act, sub, 1000, 2000) < 50
    
        # example of a parsing error
        ans(first, act1, sub, 1000)
        """)
    

    Will print:

    ans(first, act, sub, 1000)
    [['ans', 'first', 'act', 'sub', 1000]]
    [0]:
      ['ans', 'first', 'act', 'sub', 1000]
      - action: 'act'
      - element: 'first'
      - num: [1000]
      - subject: 'sub'
    
    ans(first, act, sub, 1000, 2000)
    [['ans', 'first', 'act', 'sub', 1000, 2000]]
    [0]:
      ['ans', 'first', 'act', 'sub', 1000, 2000]
      - action: 'act'
      - element: 'first'
      - num: [1000, 2000]
      - subject: 'sub'
    
    ans(first, act, sub, 1000, 2000) < 50
    [['ans', 'first', 'act', 'sub', 1000, 2000, 50]]
    [0]:
      ['ans', 'first', 'act', 'sub', 1000, 2000, 50]
      - action: 'act'
      - element: 'first'
      - num: [1000, 2000]
      - subject: 'sub'
      - timeout: 50
    
    # example of a parsing error
    ans(first, act1, sub, 1000)
                  ^
    FAIL: Expected ',', found '1'  (at char 14), (line:1, col:15)
    

    Note the use of results names to help you access the results by name, which will make your parser easier to maintain and use.