Search code examples
pythonpyparsing

Parsing multiline structured error log with pyparsing


I have a structured error log with 100 entries. Each entry has a very specific structure, which I'm trying to parse for further analysis in Excel.

Parsing the 100 entries goes fine. The structure within each entry is not there yet, and as this is my first time using pyparsing, I'm a bit lost on how to progress from here. See a working example below.

from pyparsing.core import OneOrMore

import datetime

from collections import defaultdict

test_string = """
    ErrorLog[1].ErrorDate := D#2021-3-3;
    ErrorLog[1].ErrorTime := TOD#16:1:49.567;
    ErrorLog[1].LocationRef := 20432;
    ErrorLog[1].ErrorCode := 105;
    ErrorLog[1].FirstCheck.Pos[1].Loc := 0;
    ErrorLog[1].FirstCheck.Pos[1].MatchedPos := 0;
    ErrorLog[1].FirstCheck.Pos[2].Loc := 12003;
    ErrorLog[1].FirstCheck.Pos[2].MatchedPos := 5;
    ErrorLog[1].SecondCheck.ID[1] := '4';
    ErrorLog[1].SecondCheck.ID[2] := '9';
    ErrorLog[1].SecondCheck.ID[3] := '0';
    ErrorLog[1].SecondCheck.ID[4] := '7';
    ErrorLog[1].SecondCheck.ID[5] := '0';
    ErrorLog[1].SecondCheck.ID[6] := '1';
    ErrorLog[1].SecondCheck.ID[7] := '8';
    ErrorLog[1].SecondCheck.ID[8] := '4';
    ErrorLog[1].SecondCheck.ID[9] := '2';
    ErrorLog[1].SecondCheck.ID[10] := '4';
    ErrorLog[1].SecondCheck.ID[11] := '0';
    ErrorLog[1].SecondCheck.ID[12] := '6';
    ErrorLog[1].SecondCheck.ID[13] := '7';
    ErrorLog[1].SecondCheck.ID[14] := '7';
    ErrorLog[1].SecondCheck.ID[15] := '1';
    ErrorLog[1].SecondCheck.ID[16] := '0';
    ErrorLog[1].SecondCheck.ID[17] := '8';
    ErrorLog[1].SecondCheck.ID[18] := '3';
    ErrorLog[1].SecondCheck.PositionCount := 5;
    ErrorLog[1].SecondCheck.Pos[1].Loc := 11036;
    ErrorLog[1].SecondCheck.Pos[1].TotalQty := 1;
    ErrorLog[1].SecondCheck.Pos[1].MatchedQty := 1;
    ErrorLog[1].SecondCheck.Pos[2].Loc := 11031;
    ErrorLog[1].SecondCheck.Pos[2].TotalQty := 1;
    ErrorLog[1].SecondCheck.Pos[2].MatchedQty := 1;
"""

LBRK, RBRK, DOT, SEMI, COLON, DASH = map(Suppress, "[].;:-")

integer = Word(nums).setParseAction(lambda t:int(t[0]))
date = (Suppress("D#") + integer + DASH + integer + DASH + integer).setParseAction(lambda t:datetime.datetime(*t))
time = Suppress("TOD#") + integer + COLON + integer + COLON + integer + DOT + integer

key = Word(printables)
value = date | time | Word(printables, exclude_chars=";") 

ID = Suppress("ErrorLog") + LBRK + Word(nums) + RBRK + DOT

struct = Forward()
error_expr = Group(ID("id") + key("key") + Suppress(":=") + value("value") + SEMI)

struct << Dict(OneOrMore(error_expr))

parse_results = struct.parse_file('test.txt')
errors = defaultdict(list)

for event in parse_results:
    errors[event[0]].append(event[2])

print(errors)

This outputs the following structure

defaultdict(<class 'list'>, {'1': [datetime.datetime(2021, 3, 3, 0, 0), 16, '20432', '105', '0', '0', '12003', '5', "'4'", "'9'", "'0'", "'7'", "'0'", "'1'", "'8'", "'4'", "'2'", "'4'", "'0'", "'6'", "'7'", "'7'", "'1'", "'0'", "'8'", "'3'", 
'5', '11036', '1', '1', '11031', '1', '1']})

Issues I'm trying to remedy

  1. I would like to have the timestamp included when parsing the date, to make one single date time.

  2. The SecondCheck includes an ID which is basically a string of 18 characters. I would like to parse these into one field.

  3. The output format should ideally be a list of dicts, with each dict containing the key value pairs.

Thought process

It seems like I need to use something other than the semicolon to distinguish the different fields. Using the semicolon works fine for anything except the fields that should be aggregated from more than one line, and I think I get the basic principle of structuring the parser elements, but after banging my head against this for a few days, I am very happy to get some tips or hints on how to solve this.


Solution

  • The combination of values across multiple lines is actually more complicated than it looks, so I broke the problem up into parsing the individual lines, and then merging the parsed results into the desired structure.

    Your initial code to define punctuation and the value expressions was a good start. I added an expression for parsing quoted strings also:

    LBRK, RBRK, DOT, SEMI, COLON, DASH = map(Suppress, "[].;:-")
    
    integer = Word(nums).setParseAction(lambda t: int(t[0]))
    date = (Suppress("D#") + integer + DASH + integer + DASH + integer).setParseAction(lambda t: datetime.datetime(*t))
    time = Suppress("TOD#") + integer + COLON + integer + COLON + integer + DOT + integer
    time.setParseAction(lambda t: datetime.time(hour=t[0], minute=t[1], second=t[2], microsecond=t[3]*1000))
    qs = quoted_string.add_parse_action(remove_quotes)
    

    Using Word(printables) for your key makes the key structures just flat strings, but we need to parse them into parts, along with recognizing when some parts are actually indexed list items.

    Just to break out the steps a bit more, here is a quasi-BNF for a single line of your input (alternatives are separated with '|'s, optional elements are in []'s, repetition is shown with ...) :

    error_expr ::= key ':=' value ';'
    key ::= name[index] ['.' name[index]]...
    name ::= alpha...
    index ::= '[' integer ']'
    value ::= integer | date | time | quoted_string | non-semi...
    

    (I added a bunch of setName() calls to label these expressions, and generated this railroad diagram using the new pyparsing create_diagram() method.)

    railroad diagram

    We are going to make the key expression more explicit than just matching any group of printables, but actually parse the separate names and optional [n] indexes. I also added parse actions to make the name-index parts tuples, and one to make the entire key a tuple:

    key = delimited_list((name + Optional(index)).add_parse_action(tuple), delim=".")
    key.add_parse_action(tuple)
    

    What you had for error_expr was fine. Since we are just parsing the individual lines, it isn't necessary for struct to be Forward, it can just be a one or more error_exprs. I kept your Dict construct, because it will come in handy in the second part of the solution:

    value = integer | date | time | qs | Word(printables, exclude_chars=";")
    error_expr = Group(key("key") + Suppress(":=") + value("value") + SEMI)
    
    struct = Dict(OneOrMore(error_expr))
    

    With this parser, I parsed your test string, and printed out the results using pprint:

    parse_results = struct.parse_string(test_string)
    
    from pprint import pprint
    pprint(parse_results.as_dict())
    

    With these partial results:

    {(('ErrorLog', 1), ('ErrorCode',)): 105,
     (('ErrorLog', 1), ('ErrorDate',)): datetime.datetime(2021, 3, 3, 0, 0),
     (('ErrorLog', 1), ('ErrorTime',)): datetime.time(16, 1, 49, 567000),
     (('ErrorLog', 1), ('FirstCheck',), ('Pos', 1), ('Loc',)): 0,
     (('ErrorLog', 1), ('FirstCheck',), ('Pos', 1), ('MatchedPos',)): 0,
     (('ErrorLog', 1), ('FirstCheck',), ('Pos', 2), ('Loc',)): 12003,
     (('ErrorLog', 1), ('FirstCheck',), ('Pos', 2), ('MatchedPos',)): 5,
     (('ErrorLog', 1), ('LocationRef',)): 20432,
     ...
     
    

    We can see that we get a dict where each key is a tuple of name-index or just name for each part in the error expression key, and the values are the parsed values. So the next step is to work through this list of key-values, and build them into a structure.

    Since we are going to work through a sequence of keys and pull out groups by the successive key names, itertools.groupby is a logical choice. groupby can work through a series of items, and return them in groups based on some key function.

    This code was pretty hairy, and was the main part of the problem.

    def make_nested_groups(parent, idx, seq):
        from itertools import groupby
    
        # every item in seq is a tuple of either (name,) or (name, #)
        # to detect and merge lists, group by name
        for field_label, field_subs in groupby(seq, lambda x: x[idx][0]):
            current = []
            # get subgroups by separate element number
            for field, subfields in groupby(field_subs, key=lambda x: x[idx]):
                subs = list(subfields)
                # if indexes are given, this is a list of subitems
                if len(field) > 1:
                    if not current:
                        parent.append([field_label, current])
    
                    # if we are at the last part of the key, just append
                    # the value; otherwise, append a nested group
                    if len(subs[0]) == idx + 1:
                        current.append(parse_results[subs[0]])
                    else:
                        vals = []
                        make_nested_groups(vals, idx + 1, subs)
                        current.append(vals)
    
                else:
                    # no index, this is just a sub-structure or a single value
                    sub = subs[0]
                    if len(sub) > idx+1:
                        vals = []
                        make_nested_groups(vals, idx + 1, subs)
                        parent.append([field[0], vals])
                    else:
                        parent.append([field[0], parse_results[sub]])
    
    
    errors = []
    make_nested_groups(errors, 0, [pr[0] for pr in parse_results])
    

    Now the items have more structure to them (as stored in errors):

    [['ErrorLog',
      [[['ErrorDate', datetime.datetime(2021, 3, 3, 0, 0)],
        ['ErrorTime', datetime.time(16, 1, 49, 567000)],
        ['LocationRef', 20432],
        ['ErrorCode', 105],
        ['FirstCheck',
         [['Pos',
           [[['Loc', 0], ['MatchedPos', 0]],
            [['Loc', 12003], ['MatchedPos', 5]]]]]],
        ['SecondCheck',
         [['ID',
           ['4',
            '9',
            '0',
            ...
    

    Converting this to a nested dict was much simpler:

    def make_nested_dict(seq):
        try:
            seq_dict = dict(seq)
            return {k: make_nested_dict(v) for k, v in seq_dict.items()}
        except (ValueError, TypeError):
            if isinstance(seq, list):
                return [make_nested_dict(s) for s in seq]
            return seq
    
    error_struct = make_nested_dict(errors)
    

    Which will pprint() as:

    {'ErrorLog': [{'ErrorCode': 105,
                   'ErrorDate': datetime.datetime(2021, 3, 3, 0, 0),
                   'ErrorTime': datetime.time(16, 1, 49, 567000),
                   'FirstCheck': {'Pos': [{'Loc': 0, 'MatchedPos': 0},
                                          {'Loc': 12003, 'MatchedPos': 5}]},
                   'LocationRef': 20432,
                   'SecondCheck': {'ID': ['4',
                                          '9',
                                          '0',
                                          '7',
                                          '0',
                                          '1',
                                          '8',
                                          '4',
                                          '2',
                                          '4',
                                          '0',
                                          '6',
                                          '7',
                                          '7',
                                          '1',
                                          '0',
                                          '8',
                                          '3'],
                                   'Pos': [{'Loc': 11036,
                                            'MatchedQty': 1,
                                            'TotalQty': 1},
                                           {'Loc': 11031,
                                            'MatchedQty': 1,
                                            'TotalQty': 1}],
                                   'PositionCount': 5}}]}
    

    I will leave it to you to do the last bits of merging the ID field into a single string, and combing ErrorDate and ErrorTime into a datetime.