I am diving to understand how can I send(produce) a large batch of records to a Kafka Topic from Spark.
From the docs I can see that there is an attempt to use the same producer across tasks in the same workers. When sending a lot of records at once, the network will be a bottle-neck (as well as memory, since kafka will buffer records to be sent). So I am wondering what is the best configuration to improve network usage:
Let's say the options I have for 1 and 2 are as follows (from Databricks):
To better utilize network IO, which is the best choice?
My thought on this for now, but I am not sure, so I am asking you here: Although from a CPU point of view (expensive calculations jobs), the 1) would be better (more concurrency, and less shuffle), from a network IO point of view, I would rather use 2) even if I will have fewer cores overall.
Appreciate any input on this.
Thank you all.
The best solution is to have more workers to achieve parallelism (scale horizontally). DataFrame have to be write to Kafka using streaming with Kafka as sink as explained here https://docs.databricks.com/spark/latest/structured-streaming/kafka.html (if you don't want to have persistent stream you can always use option trigger once). Additionally you can assume that 1 dataframe partition = 1cpu so you can optimize this way additionally (but databricks in streaming usually handle it automatically).
On Kafka side I guess that it could be good to have number of partitions/brokers similar to spark/databricks workers.