Search code examples
pythontornadoasynchttpclient

Why AsyncHTTPClient in Tornado doesn't send request immediately?


In my current application I use Tornado AsyncHttpClient to make requests to a web site. The flow is complex, procesing responses from previous request results in another request.

Actually, I download an article, then analyze it and download images mention in it

What bothers me is that while in my log I clearly see the message indicating that .fetch() on photo URL has beeen issued, no actual HTTP request is made, as sniffed in Wireshark

I tried tinkering with max_client_count and Curl/Simple HTTP client, but the bahvior is always the same - until all articles are downloaded not photo requests are actually issued. How can change this?

upd. some pseudo code

@VictorSergienko I am on Linux, so by default, I guess, EPoll version is used. The whole system is too complicated but it boils down to:

@gen.coroutine
def fetch_and_process(self, url, callback):
  body = yield self.async_client.fetch(url)
  res = yield callback(body)
  return res

@gen.coroutine
def process_articles(self,urls):
  wait_ids=[]
  for url in urls:
     #Enqueue but don't wait for one
     IOLoop.current().add_callback(self.fetch_and_process(url, self.process_article))
     wait_ids.append(yield gen.Callback(key=url))
  #wait for all tasks to finish
  yield wait_ids

@gen.coroutine
def process_article(self,body):
   photo_url=self.extract_photo_url_from_page(body)
   do_some_stuff()
   print('I gonna download that photo '+photo_url)
   yield self.download_photo(photo_url)

@gen.coroutine
def download_photo(self, photo_url):
  body = yield self.async_client.fetch(photo_url)
  with open(self.construct_filename(photo_url)) as f:
   f.write(body)

And when it prints I gonna download that photo no actual request is made! Instead, it keeps on downloading more articles and enqueueing more photos untils all articles are downloaded, only THEN all photos are requested in a bulk


Solution

  • AsyncHTTPClient has a queue, which you are filling up immediately in process_articles ("Enqueue but don't wait for one"). By the time the first article is processed its photos will go at the end of the queue after all the other articles.

    If you used yield self.fetch_and_process instead of add_callback in process_articles, you would alternate between articles and their photos, but you could only be downloading one thing at a time. To maintain a balance between articles and photos while still downloading more than one thing at a time, consider using the toro package for synchronization primitives. The example in http://toro.readthedocs.org/en/stable/examples/web_spider_example.html is similar to your use case.