I am using gevent to process some urls; a few of which I use lxml's etree to retrieve & parse the responses. When I retrieve those urls with etree.parse(url)
it seems to be blocking, even though I have monkey-patched everything. If I retrieve via requests, then no blocking occurs.
import time
import gevent
from lxml import etree
from gevent import monkey
monkey.patch_all()
import requests
def first():
url = 'http://www.google.com'
r = requests.get(url)
return r
def second():
url = 'http://url_to_large_xml_that_requires_api_key'
r = etree.parse(url) # this blocks "first()"
#r = requests.get(url)
return r
def get_external(i):
if i == 'first':
return first()
elif i == 'second':
return second()
threads = [gevent.spawn(get_external, i) for i in ['first', 'second']]
gevent.joinall(threads)
If I uncomment r = requests.get(url)
and comment out r = etree.parse(url)
then the entire script runs much faster and doesn't block first
. I know a solution could be to fetch with requests & then process that via etree but I'd like to understand why etree is blocking in the first place.
gevent.monkey
, as documented, makes the Python standard library cooperative with gevent's model.
lxml is not part of the standard library; thus, it should be no surprise that it isn't supported by gevent.monkey
.
Indeed, if you look at the module documentation, you'll find a list of the target modules it knows how to patch; socket
is a member of the set; lxml
is certainly not.
So, to the larger question -- the answer to "how do I monkey patch lxml for gevent support?" starts with "first, write an implementation of the underlying calls which support the gevent model".
However, as lxml is rooted in C rather than staying in Python, that's not necessarily even possible unless the interface is clearly abstracted in an accessible way -- and that lxml's open() call wasn't effected by monkey-patching the socket module is a clear indicator that (unlike, for instance, the requests module) it's using libxml2's native functionality rather than the Python socket module. Your best bet is to proceed as planned, performing the retrieval out-of-band from the parse operation.