I get a pickle error whenever I try to pass the consumer object as an argument to a function that is being submitted to dask. I'm using confluent_kafka to create the consumer, but I believe the same happens when using kafka-python. Is there any way to solve this?
Thanks.
You might be interested to try streamz, which has an integration with kafka as well as dask.
You may be interested in this blog by RapidsAI, showing how many kafka events can be processed per second with the help of GPUs.
If not using streamz, the client needs to be recreated on each worker, either as some global, or within each task (the latter incurring overhead).
Explanation of problem: the rdkafka object is essentially a reference to a C struct with both internal state, thread and open sockets. Python does not know how to "pickle" (serialise) this thing, which is how it would need to be transferred to another process, your dask worker. You could in theory create your own serialisation for dask (see here), but in practice you should instead create new consumer clients for each worker.