tokio::try_join! doesn't return the Err variant when one of the tasks returns Err?

I'm having trouble understanding the interaction between tokio::try_run! and tasks running inside tokio::spawn returning an Err. When I run the following sample:

use tokio::time::{sleep, Duration};

#[tokio::main]
async fn main() {
    let h1 = tokio::spawn(async {
        sleep(Duration::from_millis(100)).await;
        // 1/0; commented for now
        let v: Result<i32, ()> = Err(());
        v
    });

    let h2 = tokio::spawn(async {
        sleep(Duration::from_millis(500)).await;
        println!("h2 didn't get canceled");
        let v: Result<i32, ()> = Ok(2);
        v
    });

    match tokio::try_join!(h1, h2) {
        Ok((first, second)) => {
            println!("try_join was successful, got {:?} and {:?}", first, second);
        }
        Err(err) => {
            println!("try_join had an error: {:?}", err);
        }
    }
}

it prints

h2 didn't get canceled
try_join was successful, got Err(()) and Ok(2)

However, I expected it to print something like what happens with I uncomment the division by zero inside h1:

thread 'tokio-runtime-worker' panicked at 'attempt to divide by zero', src/bin/select-test.rs:7:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
try_join had an error: JoinError::Panic(...)

The try_join! docs say

The try_join! macro returns when all branches return with Ok or when the first branch returns with Err.

However, on the example that I posted, h1 does return Err but try_join! executes the Ok variant., Further, h2 doesn't get cancelled, it runs to completion even though h1 had already failed hundreds of miliseconds before. I can't understand if this is contradicting the docs or not. Also, I can't seem to be able to achieve what I was trying to do, which was to get h2 to be canceled when h1 returns Err.

After more trial and error, I found that when I remove the tokio::spawn from both h1 and h2, try_join! does execute in the way I expected and calls the Err variant. Still, I don't understand why this makes a difference.

Can anyone provide a bit more info into why this is the behavior? Do I need to remove tokio::spawn and forfeit parallel execution between h1 and h2 if I want h2 to be canceled when h1 returns an error?

Solution

First of all you have to understand how futures work. The rust async book is a good place to start.

Unlike threads, which make progress on their own, a future must be polled. If it is not polled, it will not do anything. So there are two ways to do that:

As part of another async function:

async fn foo(){
    // do something
}

async fn bar(){
    foo().await; // here foo() is being polled
}

The problem with that approach is that someone needs to drive the future. Here bar() is driving foo(), but it won't do anything unless someone drives bar() - (i.e. calls its poll()method)

Spawning a task

You can use the spawn() method to hand-over the responsibility to poll the future to the runtime. When you do that you no longer need (and cannot) call .await on the future anymore. Now the task scheduler will do that for you.

Back to the problem

So why it does not work in your case?

let h1 = tokio::spawn(async {...});
let h2 = tokio::spawn(async {...});

It does not work, because you are spawning the tasks. Think of it as if you are starting two threads (although you are not) that work independently from one another. You are no longer responsible for polling the futures - the runtime will be doing that for you. Those two tasks will run to completion regardless if their join handles are being polled or not.

I guess your confusion comes from the join handles h1 and h2 - yes - you can .await those, but they can only tell you if the task has finished or not - they will not drive the actual task - the tokio scheduler will. You can think of those like a join handle of a thread - it does not matter if you .join() the thread or not - it will still run in the background. That's why h2 still runs to completion - because the task is still being polled by the scheduler - the try_join!() macro is not driving the task.

When you do not spawn them, then the try_join!() is driving the task. It is calling .poll() on the actual future, so when task-1 completes, it stops calling .poll() on task-2, thus effectively cancelling it.

TLDR: when spawned, try_join!() is driving the join-handles, while in the other case it is driving the futures themselves.

Your other question

Do I need to remove tokio::spawn and forfeit parallel execution between h1 and h2 if I want h2 to be canceled when h1 returns an error?

No - you can use JoinHandle::abort() to manually cancel the second task

To answer your question in the comments:

Now, this raises a second question (and I think the source of my confusion): even when using tokio::spawn, select! does cancel h2 (i.e. no need for abort(), and h2 doesn't print the h2 didn't get canceled line). This seems to be what's weird to me: while select and join seem kind of similar, their behavior is the opposite.

The problem here is that your application reaches the end of main(), thus your whole runtime is getting stopped and everything cancelled. If you add a brief sleep() at the end you will see your message:

tokio::select! {
    _ = h1 => println!("H1"),
    _ = h2 => println!("H2"),
}

sleep(Duration::from_secs(2)).await;

Which results in:

H1
h2 didn't get canceled

Process finished with exit code 0