Search code examples
pythonmongodbherokuflaskpython-rq

Large memory Python background jobs


I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.

I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.

As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.

Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:

OOM command not allowed when used memory > 'maxmemory'.

I have two questions:

  1. Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
  2. Is there a way to not serialise the parameter but rather a reference to it?

Your thoughts on the best solution are much appreciated!


Solution

  • Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:

    • Load up your list of key/value pairs in a file.
    • Upload the file to Amazon S3.
    • Get the resulting file URL, and pass that into your RQ task.
    • In your worker task, download the file.
    • Parse the file line-by-line, inserting the documents into Mongo.

    Using the method above, you'll be able to:

    • Quickly break up your tasks into manageable chunks.
    • Upload these small, compressed files to S3 quickly (use gzip).
    • Greatly reduce your redis usage by requiring much less data to be passed over the wires.
    • Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
    • Greatly reduce memory consumption on your worker by processing the file one line at-a-time.

    For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.

    Hope this helps!