Given a typical webserver log file that contains a mixture of absolute URLs, relative URLs, human requests and bots (some sample lines):
112.77.167.177 - - [01/Apr/2016:22:40:09 +1100] "GET /bad-credit-loans/abc/ HTTP/1.1" 200 7532 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
189.181.124.177 - - [31/Mar/2016:23:10:47 +1100] "GET /build/assets/css/styles-1a879e1b.css HTTP/1.1" 200 31654 "https://www.abc.com.au/customer-reviews/" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_2_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13D15 Safari/601.1"
110.76.15.146 - - [01/Apr/2016:00:25:09 +1100] "GET http://www.abc.com.au/car-loans/low-doc-car-loans/ HTTP/1.1" 301 528 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
I'm looking to list all the URI's requested with status code (200, 302 etc.) and total count of requests i.e.
http://www.abc.com.au 301 3,900
/bad-credit-loans/abc/ 200 123
/bad-credit-loans/abc/ 302 7
Were it not for the presence of the varying IP addresses, timestamps, referring URLs, and user agents, I would be able to combine uniq
and sort
in the standard fashion. Or if I knew all the URLs in advance, I could simply loop over each URL-status code combo with grep
in its simplest form.
How do we disregard the varying items (user agents, timestamps etc.) and extract just the URLs and their frequency of status code?
You should just recognize taht the interesting parts are always on constant filed positions (with respect to space separated fields).
URL is at position 7 and status code is at position 9.
The rest is trivial. You may e.g. use:
awk '{sum[$7 " " $9]++;tot++;} END { for (i in sum) { printf "%s %d\n", i, sum[i];} printf "TOTAL %d\n", tot;}' LOGFILES
And then sort using sort the result if you need the outpout sorted.