How can I make FutureProducer to perform at least near the performance of ThreadedProducer in rust rdkafka?

I'm just playing around the examples, and I tried to use FutureProducer with Tokio::spawn, and I'm getting about 11 ms per produce. 1000 messages in 11000ms (11 seconds).

While ThreadedProducer produced 1000000 (1 million messages) in about 4.5 seconds (dev), and 2.6 seconds (on --release) !!!, this is insane difference between the two and maybe I missed something, or I'm not doing something ok. Why to use FutureProducer if this big speed difference exists? Maybe someone can shed the light to let me understand and to learn about the FutureProducer.

Kafka topic name is "my-topic" and it has 3 partitions.

Maybe my code is not written in the suitable way (for future producer), I need to produce 1000000 messages / less than 10 seconds using FutureProducer.

My attempts are written in the following gists (I updated this question to add new gists)

Note: After I wrote my question I tried to solve my issue by adding different ideas until I succeeded at the 7th attempt

1- spawn blocking: https://gist.github.com/arkanmgerges/cf1e43ce0b819ebdd1b383d6b51bb049

2- threaded producer https://gist.github.com/arkanmgerges/15011348ef3f169226f9a47db78c48bd

3- future producer https://gist.github.com/arkanmgerges/181623f380d05d07086398385609e82e

4- os threads with base producer https://gist.github.com/arkanmgerges/1e953207d5a46d15754d58f17f573914

5- os thread with future producer https://gist.github.com/arkanmgerges/2f0bb4ac67d91af0d8519e262caed52d

6- os thread with spawned tokio tasks for the future producer https://gist.github.com/arkanmgerges/7c696fef6b397b9235564f1266443726

7- tokio multithreading using #[tokio::main] with FutureProducer https://gist.github.com/arkanmgerges/24e1a1831d62f9c5e079ee06e96a6329

Solution

In my 5th example, I needed to use os threads (thanks for the discussion with @BlackBeans), and inside the os thread I've used tokio runtime that uses 4 worker thread and which it will block in the os thread. The example used 100 os threads, and each one has tokio runtime with 4 worker threads.

Each os thread will produce 10000 messages. The code is not optimized and I ran it in build dev.

A new example that I've done in my 7th attempt, which I used #[tokio::main] which is by default will use block_on and when I spawn a new task, it can be put in a new os thread (I've made a separate test to check it using #[tokio::main]) under the main scheduler (inside block_on). And could produced 1 million messages in 2.93 seconds (dev build) and 2.29 seconds (release build)