How to analyse apache log on 1 server with 4 cores faster

We have a lot of apache logs each week, almost 420G/week, and only a server to analyse the log, the log is such as

192.168.1.1 - - - [11/Jul/2011:23:59:59 +0800] "GET /test.html HTTP/1.1" 200 48316 31593 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; .NET CLR 2.0.7.1030)" - - "a=-; b=-; c=-" -

and my task is to get all the 2xx responses, and get the average tps for every 30 minutes, and my solution is

gzcat %s |  awk '{print $5, $10}' | grep -E \"*[\ ]2[0-9][0-9]$\" | awk -F \"[\" '{print $2}' | awk '{print $1}' | sort -n -k -1 | uniq -c

then it's easier to get the result using some calculation.

I test the code and it could handle the code at 100MB/20sec, which is just 5MB/s, so with 420G, I have to use nearly a day to handle this, how to make it faster as this server has 4 core, and 8G memory, is there a better solution?

Solution

The output of the first awk command is something like this:

[11/Jul/2011:23:59:59 200

With this format you can simplify the grep command a lot, using for example:

fgrep ' 2'

That is, you grep for the space, of which there will be only one added by awk as output field separator, and the start of the result code. By using fgrep instead of grep, you are telling grep that you are not querying with a regular expression but you are searching for a fixed string, and this makes it a lot faster.

Also, you can gain some more speed by combining the last awk commands. From:

awk -F \"[\" '{print $2}' | awk '{print $1}'

To:

awk -F '[[ ]' '{print $2}'

This script also uses both of the cores of my pc, though the second is not used at 100%. If you want to use all your cores, you'll have to divide the data to be parsed in four parts, process them in parallel and then combine the results.