Search code examples
rustrabbitmqrust-tokiorayon

Strategies for Cancelling Long-Running Rayon Tasks in Rust via RabbitMQ


I'm developing an executor in Rust to handle compute-intensive tasks as part of a larger web application. The executor receives jobs from RabbitMQ (using lapin) and processes various CPU-bound operations, which include some that internally utilize rayon. I'm aware that async can complicate rayon's usage. Post-computation or task cancellation, it communicates results or cancellation status back to RabbitMQ.

The operations in question are out of my control, consisting mainly of long-running, parallel graph algorithms. I'm exploring strategies to cancel these tasks reactively through a separate RabbitMQ cancellation queue without rewriting the underlying libraries due to their (to my understanding) non-compatibility with async patterns and the general lack of built-in task cancellation support for safety reasons.

I've prototyped an executor using tokio, only to recognize the potential issues with rayon after the fact. The current cancellation strategy is based on tokio oneshot channels, which inefficiently drop tasks without liberating CPU resources. I am looking for advice on crafting an idiomatic executor that enables effective task cancellation and conforms to Rust's safety guarantees.

Edit: I‘ll try to be more specific: Essentially, I am looking for a strategy to cancel (long-running, blocking, CPU-bound) tasks from another thread, effectively aborting the computation and freeing up the CPU. Gracefully signaling shutdown using a mechanism such as tokio‘s oneshot channel is not an option since I don‘t have control over the implementation of the long running operation.


Solution

  • The only sound way to stop a thread is for it to cooperate with that. There are multiple ways to do this, but they all involve something pervasive:

    • Rust async code may intentionally yield (return Poll::Pending from poll()) to give the executor or containing future a chance to decide not to poll it again.
    • You can consult an atomic flag or message channel and return or panic instead of continuing.
    • Finally, it is possible under some circumstances to arrange so that, in the event execution of the thread is terminated abruptly, this does not corrupt anything. However, this requires total control over all of the code that the thread executes — the thread must not call into any third-party libraries. The only systems I've heard of doing this successfully are “virtual machines” where the code is managed by an interpreter or JIT.

    So, given your constraints of calling into code you do not control, the most you can possibly do is let the thread continue until control returns to code you wrote (at which point you can panic!() to cleanly terminate the thread).

    Some ideas for how to proceed:

    • If the code you're calling itself uses Rayon, then perhaps you could contribute a cancellation feature to Rayon (or run a patched version) that would check for cancellation on each rayon::join or equivalent.
    • Remember that if the code calls any iterators you provide, those are also code you control that you can arrange to panic.)
    • Contribute a cancellation flag to the specific library you're calling.

    You may wonder: why does cancelling a thread corrupt state? Two examples:

    • The classic example that got Java to deprecate its Thread.stop() is the consideration of what happens to locks the thread is holding: if they remain locked, then whatever the locks are shared with will be itself blocked forever, creating a contagion of hung threads; if they are unlocked, then invalid half-edited state may be revealed. (In Rust, we have “poisoning” of locks, but this does not address e.g. possibly dangling pointers or other kinds of invalid values; only application-level situations that could arise from a panic unwind.)

    • Threads may borrow data from other threads' stacks; both Rayon and std::thread::scope() do this. Cancelling a thread abruptly would lead to use-after-free.

    Both of these things could potentially be addressed, by adding suitable checks to them or prohibiting them (e.g. using strictly channels and atomics, no locks), but you'd have to make sure that all of the code the thread executes obeys suitable restrictions, and Rust, being a language without a “VM” or “runtime”, does not impose such restrictions globally.