Search code examples
pythonlogparser

Parse a custom log file in python


I have a log file with new line character

Sample File:

2019-02-12T00:01:03.428+01:00 [Error] ErrorCode {My error: "A"} -  -  - 00000000-0000-0000-6936-008007000000 
2019-02-12T00:01:03.428+01:00 [Error] ErrorCode {My error: "A"} -  -  - 00000000-0000-0000-6936-008007000000 
2019-02-12T00:03:23.944+01:00 [Information] A validation warning occurred: [[]] while running a file,
--- End of stack trace ---
    FileNotFoundError
--- End of stack trace from previous location where exception was thrown ---
    System Error

I want to split the data in three columns namely Timestamp, type_code to show whether the event is an error, warning or information and then the message.

I have used split function for this:

currentDict = {"date":line.split("] ")[0].split(" [")[0],
                   "type":line.split("] ")[0].split(" [")[1],"text":line.split(" ]")[0].split("] ")[1]}

To split the data in the given columns it works fine but gives error if i have a entry shown below

2019-02-12T00:03:23.944+01:00 [Information] A validation warning occurred: [[]] while running a file,
--- End of stack trace ---
    FileNotFoundError
--- End of stack trace from previous location where exception was thrown ---
    System Error

and second approach is using regex

with open(name, "r") as f:
         for lines in f:
             data_matcher = re.findall("^\\d{4}[-]?\\d{1,2}[-]?\\d{1,2}T\\d{1,2}:\\d{1,2}:\\d{1,2}.\\d{1,3}[+]?\\d{1,2}:\\d{1,2}",
                              lines)

Expected Output

using this i am only able to extract the timestamp but stuck as to how to extract the next to fields.


Solution

  • You don't need to be that precise with your regex:

    import re
    
    log_pattern = re.compile(r"([0-9\-]*)T([0-9\-:.+]*)\s*\[([^]]*)\](.*)")
    
    with open(name, "r") as f:
      for line in f:
          match = log_pattern.match(line)
          if not match:
            continue
          grps = match.groups()
          print("Log line:")
          print(f"  date:{grps[0]},\n  time:{grps[1]},\n  type:{grps[2]},\n  text:{grps[3]}")
    

    You could even imagine being less precise than that, for example r"(.*)T([^\s]*)\s*\[([^]]*)\](.*)" works too. Here is a nice tool to use to test regular expressions: regex101.