Search code examples
pythontornadononblocking

tornado web scraping is slower than threads impl


I implemented a simple web scraping using tornado, the main idea is to insert all urls to a queue q and spawn multiple workers, to ping the url and check it' status (most urls doesn't exists i.e getting timeouts)

all responses inserted to another queue q2 but it's irrelevant cause processing this queue happen after all workers done

I also implemented the same methodology using threads, with the same number as concurrency and thread implementation is much faster although the treads are idle during waiting for response from the web while tornado IOLoop should be optimal for this kind of behaviour

what am I missing? thx in advanced

from tornado import httpclient, gen, ioloop, queues

concurrency = 100

@gen.coroutine
def get_response(url):
    response = yield httpclient.AsyncHTTPClient().fetch(url, raise_error=False)
    return response


@gen.coroutine
def main():
    q = queues.Queue()
    q2 = queues.Queue()

    @gen.coroutine
    def fetch_url():
        url = yield q.get()
        try:
            response = yield get_response(url)
            q2.put((url, response.code))
        finally:
            q.task_done()

    @gen.coroutine
    def worker():
        while True:
            yield fetch_url()

    for url in urls:
        q.put(url)

    print("all tasks were sent...")

    # Start workers, then wait for the work queue to be empty.
    for _ in range(concurrency):
        worker()

    print("workers spwaned")

    yield q.join()
    
    print("done")


if __name__ == '__main__':
    io_loop = ioloop.IOLoop.current()
    io_loop.run_sync(main)

the thread impl is simple (no multi-processing) and uses this following code

for i in range(concurrency):
  t = threading.Thread(target=worker, args=())
  t.setDaemon(True)
  t.start()

Solution

  • There are several reasons why this might be slower:

    1. The goal of asynchronous programming is not speed, it is scalability. The asynchronous implementation should perform better at high levels of concurrency (in particular, it will use much less memory), but at low levels of concurrency there may not be a difference or threads may be faster.

    2. Tornado's default HTTP client is written in pure python and is missing some features that are important for performance. In particular it is unable to reuse connections. If performance of HTTP client requests is important to you, use the libcurl-based client instead:

      tornado.httpclient.AsyncHTTPClient.configure('tornado.curl_httpclient.CurlAsyncHTTPClient')
      
    3. Sometimes DNS resolution is blocking even in an otherwise-asynchronous HTTP client, which can limit effective concurrency. This was true of Tornado's default HTTP client until Tornado 5.0. For the curl-based client, it depends on how libcurl was built. You need a version of libcurl that was built with the c-ares library. Last time I looked this was not done by default on most linux distributions.