apache-spark network-programming apache-kafka spark-structured-streaming spark-streaming-kafka

Spark and Kafka: how to increase parallelism for producer sending large batch of records improving network usage?

I am diving to understand how can I send(produce) a large batch of records to a Kafka Topic from Spark.

From the docs I can see that there is an attempt to use the same producer across tasks in the same workers. When sending a lot of records at once, the network will be a bottle-neck (as well as memory, since kafka will buffer records to be sent). So I am wondering what is the best configuration to improve network usage:

Fewer workers with more cores (so I suppose, this means more threads)
More workers with fewer cores per worker (so I suppose we will use better network IO, since it will be spread across different machines)

Let's say the options I have for 1 and 2 are as follows (from Databricks):

4 workers with 16 cores per worker = 64 cores
10 workers with 4 cores per worker = 40 cores

To better utilize network IO, which is the best choice?

My thought on this for now, but I am not sure, so I am asking you here: Although from a CPU point of view (expensive calculations jobs), the 1) would be better (more concurrency, and less shuffle), from a network IO point of view, I would rather use 2) even if I will have fewer cores overall.

Appreciate any input on this.

Thank you all.

Solution

The best solution is to have more workers to achieve parallelism (scale horizontally). DataFrame have to be write to Kafka using streaming with Kafka as sink as explained here https://docs.databricks.com/spark/latest/structured-streaming/kafka.html (if you don't want to have persistent stream you can always use option trigger once). Additionally you can assume that 1 dataframe partition = 1cpu so you can optimize this way additionally (but databricks in streaming usually handle it automatically).

On Kafka side I guess that it could be good to have number of partitions/brokers similar to spark/databricks workers.