I have a log file in the below format. For each line I need to capture 3rd column e.g 0102b69880c4b330
, corresponding message DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG
and their respective counts (please see the output). I thought using regular expressions makes solution easier for me.
Explanation:
Case 1: The ID 0102b69880c4b330
occured 3 times (line 1, 2, 3). So the count for ID is 3 and the corresponding message DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG
also occurred 3 times, so the count 3.
Case 2: Now the ID 0102b69880c4e3b2
in 4th and 5th line has two different messages JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH
, the ID count is 2 but the count for their message should be 1, 1 respectively.
Case 3: The ID 0102b6988000000c
in 10th line to last line has message DM_WORKFLOW_E_PROCESS_AUTO_TASK
. The ID count is 3, message count is 3. But here I need to take the process task id and workflow id which is next to this error message.
I used [Ignore for this]
in output to just explain I don't need id's.
And finally I also need to maintain the total count of DM_WORKFLOW_E_PROCESS_AUTO_TASK
.
Input:
2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T05:24:50.362702 17114[17114] 0102b69880c4fb1e DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T00:50:11.761273 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance phchbs-sp220333_9200_IndexAgent
2019-05-05T00:50:12.015521 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 1 seconds.
2019-05-05T00:50:12.015563 2591[2591] 0102b69880c4ccde [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance phchbs-sp220333_9200_IndexAgent -details false to Index Agent phchbs-sp220333_9200_IndexAgent is successful.
I need to get the below output:
Output:
ID: Count: Message: Corresponding Message Count Task ID: Workflow ID
0102b69880c4b330 3 DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG 3 [Ignore for this] [Ignore for this]
0102b69880c4e3b2 2 JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH 1, 1 [Ignore for this] [Ignore for this]
0102b6988000000b 1 DM_WORKFLOW_E_PROCESS_AUTO_TASK 1 4a02b698800aad04 4d02b6988000f709
0102b69880c4fb1e 3 JMS DO_METHOD TRACE LAUNCH, DO_METHOD TRACE LAUNCH 2, 1 [Ignore for this] [Ignore for this]
0102b6988000000c 3 DM_WORKFLOW_E_PROCESS_AUTO_TASK 3 4a02b698800a977c, 4a02b698800a9782, 4a02b698800a9784 4d02b698800107e9, 4d02b6988001081e, 4d02b6988001081c
The program I have tried for testing is below. I have not properly used regex after ID column, I just picked the value that contains value within [], but it skips that doesn't. And it doesn't pick process task id and workflow id as well. Can you help me how to modify my code to get the proper count, task id and workflow id?
import re
import collections
regexp = re.compile(
r'(?P<date>[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{6}\s*)'+
'(?P<un_num>[0-9]{3,5}\[[0-9]{3,5}\]\s*)'+
'(?P<id>[a-z0-9]{16}\s*)'+
'(?P<message>\[(.*?)\])'
)
ls = ["2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent",
"2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.",
"2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.",
"2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod",
"2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod",
"2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors.'",
"2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
"2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'",
"2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: 'Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs.'"
]
id_counter = collections.Counter()
message_counter = collections.Counter()
print("started......!!!!!")
for i in range(len(ls)):
x = regexp.match(ls[i])
y = re.search(regexp, ls[i])
if x is None or y is None:
print("None")
continue
print("-----------------")
print(y.group('date'))
print(y.group('un_num'))
print(y.group('id'))
id_counter.update([y.group('id')])
print(y.group('message'))
message_counter.update([y.group('message')])
print("end....!!!")
print(id_counter)
print(message_counter)
def print_counts(cdict):
for key, values in enumerate(cdict.items()):
print(key, values)
print_counts(id_counter)
print_counts(message_counter)
The output for this is:
started......!!!!!
-----------------
2019-05-05T00:05:11.507245
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759829
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
-----------------
2019-05-05T00:05:11.759898
12090[12090]
0102b69880c4b330
[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]
None
None
-----------------
2019-05-07T05:42:21.171087
22484[22484]
0102b6988000000b
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:44:35.410674
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:50:31.383668
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
-----------------
2019-05-05T05:53:49.978053
12791[12791]
0102b6988000000c
[DM_WORKFLOW_E_PROCESS_AUTO_TASK]
end....!!!
Counter({'0102b69880c4b330\t': 3, '0102b6988000000c\t': 3, '0102b6988000000b ': 1})
Counter({'[DM_WORKFLOW_E_PROCESS_AUTO_TASK]': 4, '[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]': 3})
0 ('0102b69880c4b330\t', 3)
1 ('0102b6988000000b ', 1)
2 ('0102b6988000000c\t', 3)
0 ('[DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG]', 3)
1 ('[DM_WORKFLOW_E_PROCESS_AUTO_TASK]', 4)
Starting with the input data as text:
txt = """
2019-05-05T00:05:11.507245 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info: Attempting to status Index Agent Instance host-address_9200_IndexAgent
2019-05-05T00:05:11.759829 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : Response from HTTP_POST command: HTTP/1.1 200 OK Status: 0 , Time Taken: 0 seconds.
2019-05-05T00:05:11.759898 12090[12090] 0102b69880c4b330 [DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG] info : HTTP_POST with args -command status -docbase SubWayX -user dm_fulltext_index_user -ticket ****** -instance host-address_9200_IndexAgent -details false to Index Agent host-address_9200_IndexAgent is successful.
2019-05-05T01:40:53.148751 20135[20135] 0102b69880c4e3b2 JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: Xie Xiaoke, session id: 0102b69880c4e3b2, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T01:40:53.148877 20135[20135] 0102b69880c4e3b2 DO_METHOD TRACE LAUNCH: method launch: successful, user: Xie Xiaoke, session id: 0102b69880c4e3b2, method: D2LifecycleChangeStateMethod
2019-05-07T05:42:21.171087 22484[22484] 0102b6988000000b [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800aad04 of workflow 4d02b6988000f709. The task is using method 'D2WFLifeCycleMethod'. Activity: 'Demote to Draft with new Version'. Check the Java Method Server log for errors."
2019-05-05T05:24:48.483966 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod, arguments:-method_verb com.emc.d2.api.methods.D2Method -class_name com.emc.d2.api.methods.D2LifecycleChangeStateMethod -__dm_docbase__ SubWayX -__dm_server_config__ host-address_SubWayX -docbase_name SubWayX -user_name dmadmin -method_return_id "0802b6988167b46e" -locale en
2019-05-05T05:24:50.362650 17114[17114] 0102b69880c4fb1e JMS DO_METHOD TRACE LAUNCH: do_method launch: successful: user: dmadmin, session id: 0102b69880c4fb1e, JMS id: 0802b69880003535, method: D2LifecycleChangeStateMethod, host:host-address.net, port:9082, path:/DmMethods/servlet/DoMethod
2019-05-05T05:24:50.362702 17114[17114] 0102b69880c4fb1e DO_METHOD TRACE LAUNCH: method launch: successful, user: dmadmin, session id: 0102b69880c4fb1e, method: D2LifecycleChangeStateMethod
2019-05-05T05:44:35.410674 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a977c of workflow 4d02b698800107e9. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:50:31.383668 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9782 of workflow 4d02b6988001081e. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
2019-05-05T05:53:49.978053 12791[12791] 0102b6988000000c [DM_WORKFLOW_E_PROCESS_AUTO_TASK]error: "Workflow Agent failed to process task 4a02b698800a9784 of workflow 4d02b6988001081c. The task is using method 'D2WFLifeCycleMethod'. Activity: 'validate entry conditions for Effective'. Method timed out within 60 secs."
"""
We can do some preprocessing, first split into lines and discard empty lines:
lines = [line for line in txt.split('\n') if line.strip()]
Then extract the chunks we're interested in, but just a rough (and very fast) splitting of the data
parts = [(line[44:60], line[64:].split(':', 1)) for line in lines]
update: since your new data is not fixed width, we'll need some other way of pre-processing it, e.g.:
# parts = [(line[44:60], line[64:].split(':', 1)) for line in lines]
import re
lines = [re.sub(r'\s+', ' ', line) for line in lines] # squash all multiple spaces to a single space
parts = [line.split() for line in lines] # split on whitespace
parts = [(line[2], ' '.join(line[3:]).split(':', 1)) for line in parts] # this is similar to the original line
remember, this part is only to make the final processing in the InputData class below easier.
Then we create a data structure for the input data we're interested in, that can take the pre-processed data we have in parts:
class InputData(object):
def __init__(self, idtag, (msg, details)): # py3 is more awkward here (*)
self.idtag = idtag
self.error_task = None
self.error_workflow = None
msg = msg.strip()
if msg.endswith('] info'):
self.msg = msg[1:-len('] info')]
elif msg.endswith('error'):
self.msg = msg[1:-len(']error')]
self.error_task = details.split(' task ', 1)[1].split(' ', 1)[0]
self.error_workflow = details.split(' workflow ', 1)[1].split('.', 1)[0]
else:
self.msg = msg
def __repr__(self):
return repr(self.__dict__) # this is a great trick for making debugging easier
(*) for py3 you'll need (not sure why they changed this...?)
def __init__(self, idtag, tmp):
msg, details = tmp
Now we can apply this class to the pre-processed input:
input_data = [InputData(*part) for part in parts]
and if we print out what we have so far:
for d in input_data:
print d
the result is:
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4b330', 'msg': 'DM_FT_INDEX_T_INIT_INDEX_AGENT_MSG'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'JMS DO_METHOD TRACE LAUNCH'}
{'error_workflow': None, 'error_task': None, 'idtag': '0102b69880c4e3b2', 'msg': 'DO_METHOD TRACE LAUNCH'}
{'error_workflow': '4d02b6988000f709', 'error_task': '4a02b698800aad04', 'idtag': '0102b6988000000b', 'msg': 'DM_WORKFLOW_E_PROCESS_AUTO_TASK'}
...
Now we create a class representing the data we want in the output:
from collections import defaultdict
class OutputData(object):
def __init__(self): # I'm using this class in a defaultdict, so the __init__ method can't take any arguments
self.idtag = None
self.idtag_count = 0
self.messages = defaultdict(int)
self.errors = []
self.workflows = []
def add(self, indata):
"Adds indata to this object."
self.idtag = indata.idtag
self.idtag_count += 1
self.messages[indata.msg] += 1
if indata.error_task:
self.errors.append(indata.error_task)
self.workflows.append(indata.error_workflow)
and feed out input data into it:
output_data = defaultdict(OutputData)
for indata in input_data:
output_data[indata.idtag].add(indata)
finally, we can output the output data in the desired format:
fmt = '%-20s %-6s %-55s %-15s %-60s %s'
print fmt % ('ID:', 'Count:', 'Message:', 'msg counts', 'taskid', 'workflowid')
for outdata in output_data.values():
print fmt % (
outdata.idtag,
outdata.idtag_count,
', '.join(outdata.messages.keys()),
', '.join(str(outdata.messages[k]) for k in outdata.messages.keys()),
', '.join(outdata.errors),
', '.join(outdata.workflows)
)
This type of structure, ie.: pre-process text, extract interesting input data, convert input data to output data, and finally serializing/formatting the output data; works well for all such problems, and it makes it easier to both debug and modify in the future.