Search code examples
linuxbashawkgrepwebserver

Counting requests and status codes per URI in a webserver log


Given a typical webserver log file that contains a mixture of absolute URLs, relative URLs, human requests and bots (some sample lines):

112.77.167.177 - - [01/Apr/2016:22:40:09 +1100] "GET /bad-credit-loans/abc/ HTTP/1.1" 200 7532 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
189.181.124.177 - - [31/Mar/2016:23:10:47 +1100] "GET /build/assets/css/styles-1a879e1b.css HTTP/1.1" 200 31654 "https://www.abc.com.au/customer-reviews/" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_2_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13D15 Safari/601.1"
110.76.15.146 - - [01/Apr/2016:00:25:09 +1100] "GET http://www.abc.com.au/car-loans/low-doc-car-loans/ HTTP/1.1" 301 528 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"

I'm looking to list all the URI's requested with status code (200, 302 etc.) and total count of requests i.e.

Were it not for the presence of the varying IP addresses, timestamps, referring URLs, and user agents, I would be able to combine uniq and sort in the standard fashion. Or if I knew all the URLs in advance, I could simply loop over each URL-status code combo with grep in its simplest form.

How do we disregard the varying items (user agents, timestamps etc.) and extract just the URLs and their frequency of status code?


Solution

  • You should just recognize taht the interesting parts are always on constant filed positions (with respect to space separated fields).

    URL is at position 7 and status code is at position 9.

    The rest is trivial. You may e.g. use:

    awk '{sum[$7 " " $9]++;tot++;} END { for (i in sum) { printf "%s %d\n", i, sum[i];} printf "TOTAL %d\n", tot;}' LOGFILES 
    

    And then sort using sort the result if you need the outpout sorted.