Search code examples
pythontornado

Tornado: How to get and return large data with less memory usage?


I have web-crawler and http interface for it.

Crawler gets grouped urls as dictionary. I need to return a result in the same format in JSON. But I was faced with a large memory usage, which is not returned to the operating system. How can I implement this solution without large memory usage?

Code:

#!/usr/bin/env python
# coding=utf-8

import collections

import tornado.web
import tornado.ioloop
import tornado.queues
import tornado.httpclient


class ResponseError(Exception):

    pass


class Crawler(object):

    client = tornado.httpclient.AsyncHTTPClient()

    def __init__(self, groups, concurrency=10, retries=3, validators=None):
        self.groups = groups
        self.concurrency = concurrency
        self.retries = retries
        self.validators = validators or []

        self.requests = tornado.queues.Queue()
        self.responses = collections.defaultdict(list)

    async def worker(self):
        while True:
            await self.consume()

    async def validate(self, response):
        for validator in self.validators:
            validator(response)

    async def save(self, response):
        self.responses[response.request.group].append(response.body.decode('utf-8'))

    async def consume(self):
        async for request in self.requests:
            try:
                response = await self.client.fetch(request, raise_error=False)

                await self.validate(response)
                await self.save(response)
            except ResponseError:
                if request.retries < self.retries:
                    request.retries += 1
                    await self.requests.put(request)
            finally:
                self.requests.task_done()


    async def produce(self):
        for group, urls in self.groups.items():
            for url in urls:
                request = tornado.httpclient.HTTPRequest(url)
                request.group = group
                request.retries = 0
                await self.requests.put(request)

    async def fetch(self):
        await self.produce()

        for __ in range(self.concurrency):
            tornado.ioloop.IOLoop.current().spawn_callback(self.worker)

        await self.requests.join()



class MainHandler(tornado.web.RequestHandler):

    async def get(self):
        urls = []

        with open('urls') as f:  # mock
            for line in f:
                urls.append(line.strip())

        crawler = Crawler({'default': urls})

        await crawler.fetch()

        self.write(crawler.responses)


if __name__ == '__main__':
    app = tornado.web.Application(
        (tornado.web.url(r'/', MainHandler),), debug=True
    )
    app.listen(8000)

    tornado.ioloop.IOLoop.current().start()

Solution

  • It looks to me like most of the memory usage is devoted to self.responses. Since you seem to be ordering responses by "group" before writing them to a file, I can understand why you do it this way. One idea is to store them in a database (MySQL or MongoDB or whatever) with the "group" as column or field value in the database record.

    The database might be the final destination of your data, or else it might be a temporary place to store the data until crawler.fetch completes. Then, query all the data from the database, ordered by "group", and write it to the file.

    This doesn't solve the problem, it just means that the database process is responsible for most of your memory usage, instead of the Python process. This may be preferable for you, however.