Search code examples
pythonrabbitmqgevent

how to validate url using gevent


I have one million urls to be validated. some of them is unreachable from my country,and some are invalid,I want to validate all the urls. I use python to do that,and use gevent to accelerate the speed,but I'm new to gevent,some seem didn't work. my code is following:

import gevent
import gevent.monkey
import urllib2
from gevent.pool import Pool
from gevent import Timeout
gevent.monkey.patch_all()
p = Pool(10)

seconds = 10

#timeout = Timeout(seconds)
#timeout.start()
#timer = Timeout(3).start()

def down(url):
    urllib2.urlopen(url)


def wait():
    while True:
        gevent.sleep(0)
        print 'hi'
        with Timeout(5,False):
            p.spawn(down,'http://www.twitter.com')
        print '---------------------------------'
wait()

twitter is unreachable from my country,the output is:

hi
---------------------------------
hi
---------------------------------
hi

and it didn't tell timeout after 5 seconds,what's wrong with my code?

and I wonder to know how to add a new task to gevent when running.

I want to validate all my urls in distribution,so I read url from my database and send the url to a message queue,a lot receivers receive message from the message and then validate the url.

my message is rabbitmq.

I just know if I have a 10 urls,I can use gevent like:

for x in xrange(10)
    tasks.append(gevent.spawn(validate,url))
gevent.joinall(tasks)

but in my situation,I just read one message and then spawn a greenlet,and if a url is unreachable,it will block the message until the greenlet is finished.

so how can I do some async way to validate my urls? such as I always read the url back and spawn greenlet without blocking.

thx


Solution

  • You need to wrap your IO/"waiting" code with the with Timeout(). Right now, you're wrapping the gevent.spawn()/pool.spawn() call, which isn't right. In this case, the IO code you want to timeout on is urllib2.urlopen(url).

    Code of this nature would typically look something like this:

    validated = []
    urls = ["http://a.com", "http://b.com"]
    
    def down(url):
        with Timeout(5, False):
            urllib2.urlopen(url)
            validated.append(url)
    
    pool = gevent.Pool(10)
    for url in urls:
        pool.spawn(down, url)
    pool.join() #you didn't call pool.join() in the original code because you have a wait loop already, which is okay
    print "Valid URLs are: %s" % ", ".join(validated)
    

    You can keep your infinite while True loop, and in there grab incoming URLs from your database/queue. That's probably what you want. I'm just giving an example of what I would do to check a pre-set list of URLs I want to verify.

    In that case, your error is that you wrapped pool.spawn() with the with Timeout(). The act of just spawning a greenlet will happen almost instantly, so adding a timeout around that will do nothing. That is why you aren't seeing a timeout. You need to wrap the urllib2.urlopen() call with the Timeout() context.

    Also, if you're just checking for timeouts, this works fine. You may want to check if the request returned an HTTP 200 code though, in which case you should be checking urllib2.urlopen(url).getcode().