Search code examples
pythonparsingpyparsinglogfile

Pyparsing Forward() Grammar Recursion


I'm using Pyparsing to parse a log file which has blocks that look like this:

keyName0:                                     foo
keyName1:                                     bar
msgKey [Read]:                                21 FA 00 34
msgKey [Read]:
  MESSAGE 1 of 2
    keyName0:                                 keyValue0
    keyName1:                                 keyValue1
    Flags1:                                   No Flags Set
    Flags1:                                   0
    Flags2:                                   No Flags Set
    Flags2:                                   0
    keyName6:                                 $12AB34CD56EF (123456789)
    keyName7:                                 7
    keyName8:                                 7
    Data [Read]:                              00 01 02 03    04 05 06 07    08 09 10 11    12 13 14 15
                                              20 21 22 23    24 25 26 27    28 29 30 31    32 33 34 35
                                              36 37 38

msgKey [Read]:                                01 02 03 04
msgKey [Read]:
  MESSAGE 2 of 2
    # same structure as message above

keyName3:                                     keyValue3
keyName4 [IN]:                                keyValue4 (123 IN)
keyName4 [OUT]:                               keyValue4 (123 OUT)

I wrote a grammar for the keyName-Value-lines:

key_line = lineEnd + OneOrMore(Word(printables_no_column)).setParseAction(' '.join).setResultsName('keyName') + Suppress(':') \
       + OneOrMore(Word(printables_no_column), stopOn=lineEnd).setParseAction(' '.join).setResultsName('keyValue')

This grammar works fine for the individual lines. Now I tried to use this grammar to describe the grammar of the whole test data:

message = Forward()
key_line = lineEnd + OneOrMore(Word(printables_no_column)).setParseAction(' '.join).setResultsName('keyName') + Suppress(':') \
       + MatchFirst(message, OneOrMore(Word(printables_no_column),stopOn=lineEnd).setParseAction(' '.join).setResultsName('keyValue'))
key_lines = ZeroOrMore(Group(key_line)).setResultsName('keys')
message << Literal('MESSAGE') + number + Literal('of')
           + number.setResultsName('totalMsgs') + key_lines

However, I think that this grammar ends in infinite recursion. I need help to figure out how to use the Forward() recursive grammar properly. Many thanks in advance!


Solution

  • This should move you forward a bit. Still may need to get better structuring overall, but I think the basic bits are here. See embedded comments:

    import pyparsing as pp
    
    # your original expression - x.setResultName("x") can now be written just x("x")
    # key_line = (lineEnd
    #             + OneOrMore(Word(printables_no_column)).setParseAction(' '.join)('keyName')
    #             + Suppress(':')
    #             + OneOrMore(Word(printables_no_column), stopOn=lineEnd).setParseAction(' '.join)('keyValue'))
    
    # literals in your grammar will be suppressed by default
    pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
    
    integer = pp.pyparsing_common.integer
    hex_byte = pp.Word(pp.hexnums, exact=2)
    
    # read everything up to ':' -  a little risky to define a Word including spaces, may want to revisit and
    # explicitly parse bits, to detect "[IN]" vs "[OUT]", etc.
    key_name_expr = pp.Word(pp.printables + " ", excludeChars=':')
    key_line = pp.Group(key_name_expr("key_name") + ':'
                        + ~pp.lineEnd()  # make sure key value is on this same line
                        + pp.empty()     # handy trick to advance past white space
                        + pp.restOfLine()('key_value'))
    
    # special key_line to read data bytes
    data_body = "Data [Read]:" + pp.OneOrMore(hex_byte)
    
    msg_body = ("msgKey [Read]:" + pp.lineEnd()
                + "MESSAGE" + integer("message_num") + "of" + integer("total_msgs")
                + pp.OneOrMore(pp.Group(key_line)("params*"), stopOn=data_body)
                + data_body("data"))
    
    msg_expr = (pp.OneOrMore(pp.LineStart() + pp.Group(key_line)("params*"), stopOn=msg_body)
                + pp.Optional(pp.Group(msg_body)("body")))
    

    Use searchString to find matching blocks, and dump them out:

    for match in msg_expr.searchString(source):
        print(match.dump())
        # some sample code showing how to access parsed data fields
        if match.body:
            print("Msg {message_num}/{total_msgs}".format_map(match.body))
            print(match.body.data)
        print()
    

    Prints (excerpt shown):

    [[['keyName1', '2']], [['msgKey [Read]', '21 FA 00 34']], ['\n', 1, 2, [['keyName0', 'keyValue0']], [['keyName1', 'keyValue1']], [['Flags1', 'No Flags Set']], [['Flags1', '0']], [['Flags2', 'No Flags Set']], [['Flags2', '0']], [['keyName6', '$12AB34CD56EF (123456789)']], [['keyName7', '7']], [['keyName8', '7']], '00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']]
    - body: ['\n', 1, 2, [['keyName0', 'keyValue0']], [['keyName1', 'keyValue1']], [['Flags1', 'No Flags Set']], [['Flags1', '0']], [['Flags2', 'No Flags Set']], [['Flags2', '0']], [['keyName6', '$12AB34CD56EF (123456789)']], [['keyName7', '7']], [['keyName8', '7']], '00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']
      - data: ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']
      - message_num: 1
      - params: [[['keyName0', 'keyValue0']], [['keyName1', 'keyValue1']], [['Flags1', 'No Flags Set']], [['Flags1', '0']], [['Flags2', 'No Flags Set']], [['Flags2', '0']], [['keyName6', '$12AB34CD56EF (123456789)']], [['keyName7', '7']], [['keyName8', '7']]]
        [0]:
          [['keyName0', 'keyValue0']]
          [0]:
            ['keyName0', 'keyValue0']
            - key_name: 'keyName0'
            - key_value: 'keyValue0'
        [1]:
          [['keyName1', 'keyValue1']]
          [0]:
            ['keyName1', 'keyValue1']
            - key_name: 'keyName1'
            - key_value: 'keyValue1'
        [2]:
          [['Flags1', 'No Flags Set']]
          [0]:
            ['Flags1', 'No Flags Set']
            - key_name: 'Flags1'
            - key_value: 'No Flags Set'
        [3]:
          [['Flags1', '0']]
          [0]:
            ['Flags1', '0']
            - key_name: 'Flags1'
            - key_value: '0'
        [4]:
          [['Flags2', 'No Flags Set']]
          [0]:
            ['Flags2', 'No Flags Set']
            - key_name: 'Flags2'
            - key_value: 'No Flags Set'
        [5]:
          [['Flags2', '0']]
          [0]:
            ['Flags2', '0']
            - key_name: 'Flags2'
            - key_value: '0'
        [6]:
          [['keyName6', '$12AB34CD56EF (123456789)']]
          [0]:
            ['keyName6', '$12AB34CD56EF (123456789)']
            - key_name: 'keyName6'
            - key_value: '$12AB34CD56EF (123456789)'
        [7]:
          [['keyName7', '7']]
          [0]:
            ['keyName7', '7']
            - key_name: 'keyName7'
            - key_value: '7'
        [8]:
          [['keyName8', '7']]
          [0]:
            ['keyName8', '7']
            - key_name: 'keyName8'
            - key_value: '7'
      - total_msgs: 2
    - params: [[['keyName1', '2']], [['msgKey [Read]', '21 FA 00 34']]]
      [0]:
        [['keyName1', '2']]
        [0]:
          ['keyName1', '2']
          - key_name: 'keyName1'
          - key_value: '2'
      [1]:
        [['msgKey [Read]', '21 FA 00 34']]
        [0]:
          ['msgKey [Read]', '21 FA 00 34']
          - key_name: 'msgKey [Read]'
          - key_value: '21 FA 00 34'
    Msg 1/2
    ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']