Search code examples
djangomultithreadingarchitecturesubprocesstwisted

Using a library which uses threads and subprocess from a django website


I need to use a library that spawns threads and subprocesses internally.

Since threads/subprocesses are discouraged within the web context, I need some kind of wrapper process around this library.

The question is how to proceed with this. Design solutions or library suggestions, please?

I'm aware of celery, but pretty much all the tasks I need to do with this library must be done ASAP, so I don't want to place them in a queue and get them executed later on.

Some considerations:

  • I would like to have access to the ORM from the wrapper.
  • Also, authentication of django users.
  • Don't really need django and the wrapper distributed in different machines, though it would be a plus.
  • Don't need language interoperability either (everything will be python).

EDIT:

Some suggestions I got in IRC:

  • Use a separated process and do Remote Procedure Call. Still don't know if I would use xml, json or w/e.
  • Use the Twisted reactor with crochet. Seems almost the same as celery to me.

EDIT 2:

Celery is definitely discarded. I had a long talk about it and I've come to the conclusion that using it for my use cause would be ridiculous. What the library I'll be using does is basically managing remote queues, so I would be NESTING queue systems.


Solution

  • My recommendation would be that you use Crochet, but perhaps most useful to you would be an explanation of the distinction between using Crochet and Celery.

    1. Celery is a distributed task queue. This will allow you to un-block your web requests by persisting some state into an external queue (and the first thing you need to do when you use Celery is select a persistence mechanism), then retrieving that automatic state from a consumer when it's time to execute the work.

    2. Twisted (via Crochet) is an event loop with event-driven I/O. You don't need to persist any external state: you simply toss the work from your Django (web-request-handling) thread to the Twisted (event-loop-work-handling) thread.

    With Celery, you need to serialize your job to an external system, as well as configure, monitor and run a message queue service. Since Celery uses Pickle by default to serialize its work, it will work magically right up until the point where you accidentally pull in gigabytes of state, or you upgrade your application and start getting random tracebacks. (Don't use pickle, don't use pickle, don't use pickle.) You can also select JSON as your serialization mechanism, which will be less surprising, but will involve lots more manual preparation of objects involved in your background processing.

    With Twisted you can just toss your work to be executed by the same process, sharing whatever objects make sense to share (although hopefully being careful to avoid accessing them from your Django request and your Twisted thread at the same time). There's nothing additional to monitor or manage or configure; everything just happens in your Python process.

    Basically, the Celery approach is more work, but also has additional advantages. By serializing all your work out to external state, you can make your workers resilient to crashes, so when your Django request completes, you know that somebody is going to handle that work eventually. The queueing system you select may have features to manage backpressure, load spikes, and other functionality which might provide a useful control plane for your background work.

    Using Crochet, on the other hand, is almost free. It doesn't impose any additional operational constraints except getting Twisted installed, there are no moving parts which may cause your system to partially fail (your message queueing system can't go down, since there is no such system, you're just calling some functions). It can also let you hold on to objects which may be tricky to serialize, like connections to outgoing systems, or credentials that you don't want to be storing in plain-text in your queueing system. However, if you want any tooling around monitoring and managing the work as it goes from your web front end to your task-running backend, you'll need to implement it yourself.