I have some questions about the performance of this simple python script:
import sys, urllib2, asyncore, socket, urlparse
from timeit import timeit
class HTTPClient(asyncore.dispatcher):
def __init__(self, host, path):
asyncore.dispatcher.__init__(self)
self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
self.connect( (host, 80) )
self.buffer = 'GET %s HTTP/1.0\r\n\r\n' % path
self.data = ''
def handle_connect(self):
pass
def handle_close(self):
self.close()
def handle_read(self):
self.data += self.recv(8192)
def writable(self):
return (len(self.buffer) > 0)
def handle_write(self):
sent = self.send(self.buffer)
self.buffer = self.buffer[sent:]
url = 'http://pacnet.karbownicki.com/api/categories/'
components = urlparse.urlparse(url)
host = components.hostname or ''
path = components.path
def fn1():
try:
response = urllib2.urlopen(url)
try:
return response.read()
finally:
response.close()
except:
pass
def fn2():
client = HTTPClient(host, path)
asyncore.loop()
return client.data
if sys.argv[1:]:
print 'fn1:', len(fn1())
print 'fn2:', len(fn2())
time = timeit('fn1()', 'from __main__ import fn1', number=1)
print 'fn1: %.8f sec/pass' % (time)
time = timeit('fn2()', 'from __main__ import fn2', number=1)
print 'fn2: %.8f sec/pass' % (time)
Here's the output I'm getting on linux:
$ python2 test_dl.py
fn1: 5.36162281 sec/pass
fn2: 0.27681994 sec/pass
$ python2 test_dl.py count
fn1: 11781
fn2: 11965
fn1: 0.30849886 sec/pass
fn2: 0.30597305 sec/pass
Why is urllib2 so much slower than asyncore in the first run?
And why does the discrepancy seem to disappear on the second run?
EDIT: Found a hackish solution to this problem here: Force python mechanize/urllib2 to only use A requests?
The five-second delay disappears if I monkey-patch the socket module as follows:
_getaddrinfo = socket.getaddrinfo
def getaddrinfo(host, port, family=0, socktype=0, proto=0, flags=0):
return _getaddrinfo(host, port, socket.AF_INET, socktype, proto, flags)
socket.getaddrinfo = getaddrinfo
Finally found a good explanation of what causes this problem, and why:
This is a problem with the DNS resolver.
This problem will occur for any DNS request which the DNS resolver does not support. The proper solution is to fix the DNS resolver.
What happens:
- Program is IPv6 enabled.
- When it looks up a hostname, getaddrinfo() asks first for a AAAA record
- the DNS resolver sees the request for the AAAA record, goes "uhmmm I dunno what it is, lets throw it away"
- DNS client (getaddrinfo() in libc) waits for a response..... has to time out as there is no response. (THIS IS THE DELAY)
- No records received yet, thus getaddrinfo() goes for a the A record request. This works.
- Program gets the A records and uses those.
This does NOT only affect IPv6 (AAAA) records, it also affects any other DNS record that the resolver does not support.
For me, the solution was to install dnsmasq (but I suppose any other DNS resolver would do).