Search code examples
regexpython-3.xparsinghtml-parsing

Parsing corrupt Apache logs using regex


I'm writing a Python 3.7.2 program to parse Apache logs looking for all successful response codes. I've got regex written right now that will parse all correct Apache log entries into individual tuples of [origin] [date/time] [HTML method/file/protocol] [response code] and [file size] and then I just check to see if the response code is 3xx. The problem is there are several entries that are corrupt, some corrupt enough to be unreadable so I've stripped them out in a different part of the program. Several are just missing the closing " (quotation mark) on the method/protocol item causing it to throw an error each time I parse that line. I'm thinking I need to use a RegEx Or expression for " OR whitespace but that seems to break the quote into a different tuple item instead of looking for say, "GET 613.html HTTP/1.0" OR "GET 613.html HTTP/1.0 I'm new to regex and thoroughly stumped, can anyone explain what I'm doing wrong?

I should note that the logs have been scrubbed of some info, instead of origin IP it only shows 'local' or 'remote' and the OS/browser info is removed entirely.

This is the regex for the relevant tuple item that works with valid entries: "(.*)?" I've also tried:

"(.*)?("|\s) - creates another tuple item and still throws error

Here's a snippet of the log entries including the last entry which is missing it's closing "

local - - [27/Oct/1994:18:47:03 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [27/Oct/1994:18:48:53 -0600] "GET index.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:49:55 -0600] "GET index.html HTTP/1.0" 303 3185
local - - [27/Oct/1994:18:50:25 -0600] "GET 612.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:50:41 -0600] "GET index.html HTTP/1.0" 200 388
local - - [27/Oct/1994:18:50:52 -0600] "GET 613.html HTTP/1.0 303 728

regex = '([(\w+)]+) - - \[(.*?)\] "(.*)?" (\d+) (\S+)'
import re

with open("validlogs.txt") as validlogs:                
    i = 0
    array = []
    successcodes = 0
    for line in validlogs:                               
        array.append(line)
        loglength = len(array)                               

    while (i < loglength):                               
        line = re.match(regex, array[i]).groups()
        if(line[3].startswith("3")):
            successcodes+=1
        i+=1
    print("Number of successcodes: ", successcodes)

Parsing the log responses above should give Number of success codes: 2 Instead I get: Traceback (most recent call last): File "test.py", line 24, in line = re.match(regex, array[i]).groups() AttributeError: 'NoneType' object has no attribute 'groups'

because (I believe) regex is looking explicitly for a " and can't handle the line entry that's missing it.


Solution

  • So I originally used re.match with ([(\w+)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) with a Try: / Except: continue code to parse all the logs that actually matched the pattern. Since ~100,000 of the ~750,000 lines didn't conform to the correct Apache logs pattern, I wound up changing my code to re.search with much smaller segments instead.

    For instance:

    with open("./http_access_log.txt") as logs:             
        for line in logs:
            if re.search('\s*(30\d)\s\S+', line):         #Checking for 30x redirect codes
                redirectCounter += 1  
    

    I've read that re.match is faster than re.search but I felt that being able to accurately capture the most possible log entries (this handles all but about 2000 lines, most of which have no usable info) was more important.