Search code examples
pythonpython-2.7cherrypypython-multithreading

How to return data from a CherryPy BackgroundTask running as fast as possible


I'm building a web service for iterative batch processing of data using CherryPy. The ideal workflow is as follows:

  1. Users POST data to the service for processing
  2. When the processing job is free, it collects the queued data and starts another iteration
  3. While the job is processing, users are POSTing more data to the queue for the next iteration
  4. Once the current iteration is finished, the results are passed back so that users can GET them using the same API.
  5. The job starts again with the next batch of queued data.

The key consideration here is that the processing should run as fast as possible with each iteration starting as soon as the previous one finishes, regardless of the amount of data in the queue. There's no upper bound on how long each iteration can take so I can't create a fixed schedule for it to run on.

There are a few examples of using BackgroundTask (like this one) but I've yet to find one that deals with returning data, or one that deals with tasks running as fast as possible as opposed to on a fixed schedule.

I'm not wedded to the BackgroundTask solution so if anyone can offer an alternative one I'd be more than happy. It feels like there's a solution within the framework though.


Solution

  • Don't run a background task using the BackgroundTask solution, because it will run in a thread and, due to the GIL, cherrypy won't be able to answer new requests. Use a queue solution that runs your background tasks in a different process, like Celery or RQ.

    I'm going to develop in detail an example using RQ. RQ uses Redis as a message broker, so first of all you need to install and start Redis.

    Then create a module (mytask in my example) with the long time running background methods:

    import time
    def long_running_task(value):
        time.sleep(15)
        return len(value)
    

    Start one (or more than one if you want to run tasks in parallel) RQ workers, it's important that the python that is running your workers has access to your mytask module (export the PYTHONPATH before running the worker if your module it's not already in the path):

    # rq worker
    

    Above you have a very simple cherrypy webapp that shows how to use the RQ queue:

    import cherrypy
    from redis import Redis
    from rq import Queue    
    from mytask import long_running_task
    
    
    class BackgroundTasksWeb(object):
    
        def __init__(self):
            self.queue = Queue(connection=Redis())
            self.jobs = []
    
        @cherrypy.expose
        def index(self):
            html =  ['<html>', '<body>']
            html += ['<form action="job">', '<input name="q" type="text" />', '<input type="submit" />', "</form>"]
            html += ['<iframe width="100%" src="/results" />']
            html += ['</body>', '</html>']
            return '\n'.join(html)
    
        @cherrypy.expose
        def results(self):
            html = ['<html>', '<head>', '<meta http-equiv="refresh" content="2" >', '</head>', '<body>']
            html += ['<ul>']
            html += ['<li>job:{} status:{} result:{} input:{}</li>'.format(j.get_id(), j.get_status(), j.result, j.args[0]) for j in self.jobs]
            html += ['</ul>']
            html += ['</body>', '</html>']
            return '\n'.join(html)
    
        @cherrypy.expose
        def job(self, q):
            job = self.queue.enqueue(long_running_task, q)
            self.jobs.append(job)
            raise cherrypy.HTTPRedirect("/")
    
    
    cherrypy.quickstart(BackgroundTasksWeb())
    

    In a production webapp I would use jinja2 template engine to generate the html, and most likely websockets to update the job status in the web browser.