Search code examples
pythonregexpandaspython-re

How to parse this log using python regex and export to excel with pandas (optional)?


I have a log file in the below format. For each line I need to capture 3rd column e.g 0102b69880c4b330, corresponding message DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG and their respective counts (please see the output). I thought using regular expressions makes solution easier for me.

Explanation:

Case 1: The ID 0102b69880c4b330 occured 3 times (line 1, 2, 3). So the count for ID is 3 and the corresponding message DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG also occurred 3 times, so the count 3.

Case 2: Now the ID 0102b69880c4e3b2 in 4th and 5th line has two different messages JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH, the ID count is 2 but the count for their message should be 1, 1 respectively.

Case 3: The ID 0102b6988000000c in 10th line to last line has message DM_WORKFLOW_E_PROCESS_AUTO_TASK. The ID count is 3, message count is 3. But here I need to take the process task id and workflow id which is next to this error message.

I used [Ignore for this] in output to just explain I don't need id's.

And finally I also need to maintain the total count of DM_WORKFLOW_E_PROCESS_AUTO_TASK.

Input:
2019-05-05T00:05:11.507245  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751  20135[20135]    0102b69880c4e3b2    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod 
2019-05-05T01:40:53.148877  20135[20135]    0102b69880c4e3b2    DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087  22484[22484]    0102b6988000000b    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966  17114[17114]    0102b69880c4fb1e    JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650  17114[17114]    0102b69880c4fb1e    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod 
2019-05-05T05:24:50.362702  17114[17114]    0102b69880c4fb1e    DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T00:50:11.761273  2591[2591]  0102b69880c4ccde    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance phchbs-sp220333_9200_IndexAgent
2019-05-05T00:50:12.015521  2591[2591]  0102b69880c4ccde    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 1 seconds.
2019-05-05T00:50:12.015563  2591[2591]  0102b69880c4ccde    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance phchbs-sp220333_9200_IndexAgent -details false to Index Agent phchbs-sp220333_9200_IndexAgent is successful.


I need to get the below output:

Output:
ID:                 Count:          Message:                                            Corresponding Message Count         Task ID:                                                Workflow ID
0102b69880c4b330     3              DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG                  3                                   [Ignore for this]                                       [Ignore for this]
0102b69880c4e3b2     2              JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH  1, 1                                [Ignore for this]                                       [Ignore for this]
0102b6988000000b     1              DM_WORKFLOW_E_PROCESS_AUTO_TASK                     1                                   4a02b698800aad04                                        4d02b6988000f709
0102b69880c4fb1e     3              JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH  2, 1                                [Ignore for this]                                       [Ignore for this]
0102b6988000000c     3              DM_WORKFLOW_E_PROCESS_AUTO_TASK                     3                                   4a02b698800a977c, 4a02b698800a9782, 4a02b698800a9784    4d02b698800107e9, 4d02b6988001081e, 4d02b6988001081c 

The program I have tried for testing is below. I have not properly used regex after ID column, I just picked the value that contains value within [], but it skips that doesn't. And it doesn't pick process task id and workflow id as well. Can you help me how to modify my code to get the proper count, task id and workflow id?

import re
import collections

regexp = re.compile(
        r'(?P<date>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{6}\s*)'+
        '(?P<un_num>[0-9]{3,5}\[[0-9]{3,5}\]\s*)'+
        '(?P<id>[a-z0-9]{16}\s*)'+
        '(?P<message>\[(.*?)\])'
        )
ls = ["2019-05-05T00:05:11.507245   12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent",
      "2019-05-05T00:05:11.759829   12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.",
      "2019-05-05T00:05:11.759898   12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.",
      "2019-05-05T01:40:53.148751   20135[20135]    0102b69880c4e3b2    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod",
      "2019-05-05T01:40:53.148877   20135[20135]    0102b69880c4e3b2    DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod",
      "2019-05-07T05:42:21.171087   22484[22484]    0102b6988000000b    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors.'",
      "2019-05-05T05:44:35.410674   12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
      "2019-05-05T05:50:31.383668   12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
      "2019-05-05T05:53:49.978053   12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  'Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'"
      ]

id_counter = collections.Counter()
message_counter = collections.Counter()

print("started......!!!!!")
for i in range(len(ls)):
    x = regexp.match(ls[i])
    y = re.search(regexp, ls[i])
    if x is None or y is None:
        print("None")
        continue
    print("-----------------")
    print(y.group('date'))
    print(y.group('un_num'))
    print(y.group('id'))
    id_counter.update([y.group('id')])
    print(y.group('message'))
    message_counter.update([y.group('message')])

print("end....!!!")

print(id_counter)
print(message_counter)

def print_counts(cdict):
    for key, values in enumerate(cdict.items()):
        print(key, values)

print_counts(id_counter)
print_counts(message_counter)

The output for this is:

started......!!!!!
-----------------
2019-05-05T00:05:11.507245      
12090[12090]    
0102b69880c4b330        
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759829      
12090[12090]    
0102b69880c4b330        
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759898      
12090[12090]    
0102b69880c4b330        
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
None
None
-----------------
2019-05-07T05:42:21.171087      
22484[22484]    
0102b6988000000b    
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:44:35.410674      
12791[12791]    
0102b6988000000c        
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:50:31.383668      
12791[12791]    
0102b6988000000c        
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:53:49.978053      
12791[12791]    
0102b6988000000c        
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
end....!!!
Counter({'0102b69880c4b330\t': 3, '0102b6988000000c\t': 3, '0102b6988000000b    ': 1})
Counter({'[DM_WORKFLOW_E_PROCESS_AUTO_TASK]': 4, '[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]': 3})
0 ('0102b69880c4b330\t', 3)
1 ('0102b6988000000b    ', 1)
2 ('0102b6988000000c\t', 3)
0 ('[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]', 3)
1 ('[DM_WORKFLOW_E_PROCESS_AUTO_TASK]', 4)

Solution

  • Starting with the input data as text:

    txt = """
    2019-05-05T00:05:11.507245  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
    2019-05-05T00:05:11.759829  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
    2019-05-05T00:05:11.759898  12090[12090]    0102b69880c4b330    [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
    2019-05-05T01:40:53.148751  20135[20135]    0102b69880c4e3b2    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod 
    2019-05-05T01:40:53.148877  20135[20135]    0102b69880c4e3b2    DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
    2019-05-07T05:42:21.171087  22484[22484]    0102b6988000000b    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
    2019-05-05T05:24:48.483966  17114[17114]    0102b69880c4fb1e    JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
    2019-05-05T05:24:50.362650  17114[17114]    0102b69880c4fb1e    JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod 
    2019-05-05T05:24:50.362702  17114[17114]    0102b69880c4fb1e    DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
    2019-05-05T05:44:35.410674  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
    2019-05-05T05:50:31.383668  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
    2019-05-05T05:53:49.978053  12791[12791]    0102b6988000000c    [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error:  "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
    """
    

    We can do some preprocessing, first split into lines and discard empty lines:

    lines = [line for line in txt.split('\n') if line.strip()]
    

    Then extract the chunks we're interested in, but just a rough (and very fast) splitting of the data

    parts = [(line[44:60], line[64:].split(':', 1))  for line in lines]
    

    update: since your new data is not fixed width, we'll need some other way of pre-processing it, e.g.:

    # parts = [(line[44:60], line[64:].split(':', 1))  for line in lines]
    import re
    lines = [re.sub(r'\s+', ' ', line) for line in lines]   # squash all multiple spaces to a single space
    parts = [line.split() for line in lines]  # split on whitespace
    parts = [(line[2], ' '.join(line[3:]).split(':', 1)) for line in parts]  # this is similar to the original line
    

    remember, this part is only to make the final processing in the InputData class below easier.

    Then we create a data structure for the input data we're interested in, that can take the pre-processed data we have in parts:

    class InputData(object):
        def __init__(self, idtag, (msg, details)):  # py3 is more awkward here (*)
            self.idtag = idtag
            self.error_task = None
            self.error_workflow = None
            msg = msg.strip()
            if msg.endswith('] info'):
                self.msg = msg[1:-len('] info')]
            elif msg.endswith('error'):
                self.msg = msg[1:-len(']error')]
                self.error_task = details.split(' task ', 1)[1].split(' ', 1)[0]
                self.error_workflow = details.split(' workflow ', 1)[1].split('.', 1)[0]
            else:
                self.msg = msg
    
        def __repr__(self):
            return repr(self.__dict__)  # this is a great trick for making debugging easier
    

    (*) for py3 you'll need (not sure why they changed this...?)

    def __init__(self, idtag, tmp):
        msg, details = tmp
    

    Now we can apply this class to the pre-processed input:

    input_data = [InputData(*part) for part in parts]
    

    and if we print out what we have so far:

    for d in input_data:
        print d
    

    the result is:

    {'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
    {'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
    {'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
    {'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'JMS DO_METHOD TRACE LAUNCH'}
    {'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'DO_METHOD TRACE LAUNCH'}
    {'error_workflow': '4d02b6988000f709', 'error_task': '4a02b698800aad04', 'idtag': '0102b6988000000b', 'msg': 'DM_WORKFLOW_E_PROCESS_AUTO_TASK'}
    ...
    

    Now we create a class representing the data we want in the output:

    from collections import defaultdict
    
    class OutputData(object):
        def __init__(self):   # I'm using this class in a defaultdict, so the __init__ method can't take any arguments
            self.idtag = None
            self.idtag_count = 0
            self.messages = defaultdict(int)
            self.errors = []
            self.workflows = []
    
        def add(self, indata):
            "Adds indata to this object."
            self.idtag = indata.idtag
            self.idtag_count += 1
            self.messages[indata.msg] += 1        
            if indata.error_task:
                self.errors.append(indata.error_task)
                self.workflows.append(indata.error_workflow)
    

    and feed out input data into it:

    output_data = defaultdict(OutputData)
    
    for indata in input_data:
        output_data[indata.idtag].add(indata)
    

    finally, we can output the output data in the desired format:

    fmt = '%-20s %-6s %-55s %-15s %-60s %s'
    
    print fmt % ('ID:', 'Count:', 'Message:', 'msg counts', 'taskid', 'workflowid')
    for outdata in output_data.values():
        print fmt % (
            outdata.idtag,
            outdata.idtag_count,
            ', '.join(outdata.messages.keys()),
            ', '.join(str(outdata.messages[k]) for k in outdata.messages.keys()),
            ', '.join(outdata.errors),
            ', '.join(outdata.workflows)
        )
    

    This type of structure, ie.: pre-process text, extract interesting input data, convert input data to output data, and finally serializing/formatting the output data; works well for all such problems, and it makes it easier to both debug and modify in the future.