Web server log analyzers (e.g. Urchin) often display a number of "sessions". A session is defined as a series of page visits / clicks made by an individual within a limited, continuous time segment. The attempt is made to identify these segments using IP addresses, and often supplementary info like user agent and OS, along with a session timeout threshold such as 15 or 30 minutes.
For certain web sites and applications, a user can be logged in and/or tracked with a cookie, which means the server can precisely know when a session begins. I'm not talking about that, but about inferring sessions heuristically ("session reconstruction") when the web server does not track them.
I could write some code e.g. in Python to try to reconstruct sessions based on the criteria mentioned above, but I'd rather not reinvent the wheel. I'm looking at log files of a size around 400K lines, so I'd have to be careful to use a scalable algorithm.
My goal here is to extract a list of unique IP addresses from a log file, and for each IP address, to have the number of sessions inferred from that log. Absolute precision and accuracy are not necessary... pretty-good estimates are ok.
Based on this description:
a new request is put in an existing session if two conditions are valid:
- the IP address and the user-agent are the same of the requests already
inserted in the session,- the request is done less than fifteen minutes after the last request inserted.
it would be simple in theory to write a Python program to build up a dictionary (keyed by IP) of dictionaries (keyed by user-agent) whose value is a pair: (number of sessions, latest request of latest session).
But I would rather try to use an existing implementation if one's available, since I might otherwise risk spending a lot of time tuning performance.
FYI lest someone ask for sample input, here is a line of our log file (sanitized):
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0
OK, in the absence of any other answer, here's my Python implementation. I'm not a Python expert. Suggestions for improvement are welcome.
#!/usr/bin/env python
"""Reconstruct sessions: Take a space-delimited web server access log
including IP addresses, timestamps, and User Agent,
and output a list of the IPs, and the number of inferred sessions for each."""
## Input looks like:
# Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
# 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0
import datetime
import operator
infileName = "ex100922.log"
outfileName = "visitor-ips.csv"
ipDict = {}
def inputRecords():
infile = open(infileName, "r")
recordsRead = 0
progressThreshold = 100
sessionTimeout = datetime.timedelta(minutes=30)
for line in infile:
if (line[0] == '#'):
continue
else:
recordsRead += 1
fields = line.split()
# print "line of %d records: %s\n" % (len(fields), line)
if (recordsRead >= progressThreshold):
print "Read %d records" % recordsRead
progressThreshold *= 2
# http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
# "a new request is put in an existing session if two conditions are valid:
# * the IP address and the user-agent are the same of the requests already
# inserted in the session,
# * the request is done less than fifteen minutes after the last request inserted."
theDate, theTime = fields[0], fields[1]
newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")
ipAddr, userAgent = fields[8], fields[9]
if ipAddr not in ipDict:
ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
else:
if userAgent not in ipDict[ipAddr]:
ipDict[ipAddr][userAgent] = [1, newRequestTime]
else:
ipdipaua = ipDict[ipAddr][userAgent]
if newRequestTime - ipdipaua[1] >= sessionTimeout:
ipdipaua[0] += 1
ipdipaua[1] = newRequestTime
infile.close()
return recordsRead
def outputSessions():
outfile = open(outfileName, "w")
outfile.write("#Fields: IPAddr Sessions\n")
recordsWritten = len(ipDict)
# ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
for ip, val in ipDict.iteritems():
# TODO: sum over on all keys' values [(v, k) for (k, v) in d.iteritems()].
totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
outfile.write("%s\t%d\n" % (ip, totalSessions))
outfile.close()
return recordsWritten
recordsRead = inputRecords()
recordsWritten = outputSessions()
print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)
Update: This took 39 seconds to input and process 342K records and write 21K records. That's good enough speed for my purposes. Apparently 3/4 of that time was spent in strptime()
!