I'm building a script to process web server logs, and I'm trying to incorporate MaxMinds's IP dataset (http://dev.maxmind.com/geoip/legacy/geolite/) into the script in order to get the country the hit is coming from.
Currently, my script works fine when I just have it extract the information I want, however when I try to add IP lookup's it slows down - a lot - by about 1800%. So, I'm curious if this has something to do with my code or if there is a way I can speed this up.
For example, when I run the following code extracting date and ip address, for this experiment it took about 6.5 seconds.
extractedData = []
for log in logList:
ip = log[-1]
date = log[0]
dateIP = [date, ip]
extractedData.append(dateIP)
When I add pyGeoIP and try to incorporate the country code it slows down. The following code took 2 mintues and 7seonds to run.
extractedData = []
gi = pygeoip.GeoIP('/path/to/GeoIP.dat')
for log in logList:
ip = log[-1]
country = gi.country_name_by_addr(ip)
date = log[0]
dateCountry = [date, country]
extractedData.append(dateCountry)
So, is there a way to speed this up since this look up will slow to process down too much.
Thanks!
Since you're doing many queries, you should load the database into memory. As it stands, you're repeatedly reading from the disk, which is painfully slow.
Exchange this line:
gi = pygeoip.GeoIP('/path/to/GeoIP.dat')
to this:
gi = pygeoip.GeoIP('/path/to/GeoIP.dat', pygeoip.MEMORY_CACHE)