Search code examples
pythonregexpidlogfile-analysis

Filtering Log File with RegEx


Hi I can't seem to work out how to extract the Date and PID from a log file. I'm trying to display the date and then the pid as shown below. But it will not show the PID only the date.

Please see my code:

def show_time_of_pid(line):

  pattern = r"^([\w+]*[\s\d\:]+.[\[(\d+)\]])"
  result = re.search(pattern, line)

  return result

print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)")) # Jul 6 14:01:23 pid:29440
<re.Match object; span=(0, 14), match='Jul 6 14:01:23'>

I was expecting Jul 6 14:01:23 pid:29440

I get <re.Match object; span=(0, 14), match='Jul 6 14:01:23'> **NO PID DISPLAYED


Solution

  • I would probably write things like this:

    def show_time_of_pid(line):
    
        pattern = r"^(\w{3}) \s (\d+) \s ([\d:]+) \s .[^[]+\[(\d+)]:.*"
        result = re.search(pattern, line, flags=re.VERBOSE)
    
        return result.groups()
    
    print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))
    

    Using re.VERBOSE lets us split things up to be a little easier to read. Here we have several distinct match groups:

    • (\w{3}) matches the month name
    • (\d+) matches the day of the month
    • ([\d:]+) matches the time
    • [^[]+\[(\d+)] matches the PID ("a bunch of characters that are not [ followed by [, then a string of digits, then ])

    Each group is separated by whitespace (\s).

    Running the above code produces:

    ('Jul', '6', '14:01:23', '29440')
    

    You could get fancier with an outer capture group; by writing:

    import re
    
    def show_time_of_pid(line):
    
        pattern = r"^((\w{3}) \s (\d+) \s ([\d:]+)) \s .[^[]+\[(\d+)]:.*"
        result = re.search(pattern, line, flags=re.VERBOSE)
    
        return result.groups()
    
    print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))
    

    We get the entire date string in the first capture group:

    ('Jul 6 14:01:23', 'Jul', '6', '14:01:23', '29440')
    

    And of course we can get back a labeled dictionary instead of just a list by using named capture groups:

    import re
    
    def show_time_of_pid(line):
    
        pattern = r"^(?P<timestamp>(?P<month>\w{3}) \s (?P<day>\d+) \s ([\d:]+)) \s .[^[]+\[(?P<pid>\d+)]:.*"
        result = re.search(pattern, line, flags=re.VERBOSE)
    
        return result.groupdict()
    
    print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)"))
    

    Which produces:

    {'timestamp': 'Jul 6 14:01:23', 'month': 'Jul', 'day': '6', 'pid': '29440'}