Search code examples
javaawkgroovytext-parsingprocessing-efficiency

Request log parser - Text parsing


I have to parse a request log that has following structure

07/Dec/2017:18:15:58 +0100 [293920] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293920] <- 200 text/html 5ms
07/Dec/2017:18:15:58 +0100 [293921] -> GET URL HTTP/1.1
07/Dec/2017:18:15:58 +0100 [293921] <- 200 image/png 39ms
07/Dec/2017:18:15:59 +0100 [293922] -> HEAD URL HTTP/1.0
07/Dec/2017:18:15:59 +0100 [293922] <- 401 - 1ms
07/Dec/2017:18:15:59 +0100 [293923] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293923] <- 200 text/html 178ms
07/Dec/2017:18:15:59 +0100 [293924] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293924] <- 200 text/html 11ms
07/Dec/2017:18:15:59 +0100 [293925] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293925] <- 200 text/html 7ms
07/Dec/2017:18:15:59 +0100 [293926] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293926] <- 200 text/html 16ms
07/Dec/2017:18:15:59 +0100 [293927] -> GET URL HTTP/1.1
07/Dec/2017:18:15:59 +0100 [293927] <- 200 text/html 8ms

The output should link two lines in this log based on the number between square brackets. The goal is to extract information from this logfile with other data processing software packages. I want to extract useful information using a csv file. The structure of the csv file should be as follows.

startTimestamp,endTimestamp,requestType/responseCode,URL/typ,responsetime

07/Dec/2017:18:15:58,07/Dec/2017:18:15:58,GET,200,URL,text/html,5ms

I have made a groovyScript that does the trick but it is terribly slow.

I know i can make some improvements but would like your ideas. Some of you probably have tackled this problem in the past.

The response does not always follow the request. Not every request gets a response (or is not logged due to server restart)

The log files can be from 70mb up to 300 mb. My groovyScript takes a ridiculous long time.

I know there are good and fast solutions in the unix terminal with awk and sort. But have no experience with this.

Thanks in advance for your help

Here is the code I already have possible improvements

1) use map with the key being the number for faster search and less parsing

2) don't go over the backlog list for every line

def logFile = new File("../request.log")
def outputfile = new File(logFile.parent, logFile.name + ".csv")
def backlog = new ArrayList<String>()
StringBuilder output = new StringBuilder()


outputfile.withPrintWriter { writer ->
    logFile.withReader { Reader reader ->
        reader.eachLine { String line ->
            Iterator<String> it = backlog.iterator()
            while (it.hasNext()) {
                String bLine = it.next()
                String[] lineSplit = line.split(" ")
                if (bLine.contains(lineSplit[2])) {
                    String[] bLineSplit = bLine.split(" ")
                    output.append(bLineSplit[0] + "," + lineSplit[0] + "," + bLineSplit[4] + "," + lineSplit[4] + "," + bLineSplit[5] + "," + lineSplit[5] + "," + lineSplit[6] + "\r\n")
                    //writer.println(outputline)
                    it.remove()
                }
            }
            backlog.add(line)
        }
    }
    writer.println(output)
    if (!backlog.isEmpty()) {
    }
    backlog.each { String line ->
        writer.println(line)
    }
}

Solution

  • As one-liner:

    sort -k 3,3 request.log | awk 'BEGIN { print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"; split("", request); split("", response) } $4 == "->" { printLine(); split($0, request); split("", response) } $4 == "<-" { split($0, response) } END { printLine() } function printLine() { if (length(request)) { print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7] } }'
    

    As multi-liner:

    sort -k 3,3 request.log | awk '
        BEGIN {
            print "startTimestamp;endTimestamp;requestType;responseCode;URL;typ;responsetime"
            split("", request)
        }
        $4 == "->" {
            printLine()
            split($0, request)
            split("", response)
        }
        $4 == "<-" {
            split($0, response)
        }
        END {
            printLine()
        }
        function printLine() {
            if (length(request)) {
                print request[1] ";" response[1] ";" request[5] ";" response[5] ";" request[6] ";" response[6] ";" response[7]
            }
        }'