So I have a lot of logs txt files that look somewhat like this:
2021-04-01T12:54:38.156Z START RequestId: 123 Version: $LATEST
2021-04-01T12:54:42.356Z END RequestId: 123
2021-04-01T12:54:42.356Z REPORT RequestId: 123 Duration: 4194.14 ms Billed Duration: 4195 ms Memory Size: 2048 MB Max Memory Used: 608 MB
I need to create a pandas dataframe with this data with following features where each row would present one log:
DateTime, Keyword(start/end), RequestId, Duration, BilledDuration, MemorySize, MaxMemoryUsed
The problem is that each file has different length and there are different types of logs so not every line looks the same but there are patterns. I've never used RegEx but I think this is what I have to use. So is there a way to transform this string into a dataset?
(my goal is to perform memory usage anomaly detection)
So apparently I'm still bad at asking right question on this website but gladly a bit better at finding solutions by myself so if somebody else has the same problem this is what I did:
import re
import gzip
counter = 0
for file in file_list:
# open and read
file_content = gzip.open(file, 'rb').read().decode("utf-8")
# split file in lines
splitted_file_content = file_content.splitlines()
for line in splitted_file_content:
# look for the report lines
if re.search('REPORT', line):
tokens = line.split()
timestamp = tokens[0]
id = tokens[3]
billed_duration = tokens[9]
max_memory_size_used = tokens[18]
init_duration = tokens[22]
# if you want to pack it in a dataframe
df.loc[counter] = [timestamp, id, billed_duration,
max_memory_size_used, init_duration]
counter += 1