Search code examples
apacheawklarge-data-volumes

Processing apache logs quickly


I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.

Here is the awk script:

#!/bin/bash

awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1

EDIT:

For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.

Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"

Returned output: 189.5.56.113,124237889


Solution

  • @OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion

    awk 'BEGIN{
       m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
       for(o=1;o<=m;o++){
          date[d[o]]=sprintf("%02d",o)
        }
    }
    {
        gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
        n=split($4, DATE,"/")
        day=DATE[1]
        mth=DATE[2]
        year=DATE[3]
        hr=DATE[4]
        min=DATE[5]
        sec=DATE[6]
        MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
        print $1,MKTIME
    
    }' file
    

    output

    $ more file
    189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
    $ ./shell.sh    
    189.5.56.113 1264110895