Search code examples
apache-sparkapache-kafkaspark-structured-streaming

Does the number of kafka partitions increase the speed of Spark writing to kafka?


When reading, Spark have a mapping 1:1 to kafka partitions, so, with more partitions we can leverage more parellelism to our job.

But does it apply when Spark is writing in kafka ? Writing the same dataset in one topic with 4 partitions is more fast than writing in a topic with 1 partition ?


Solution

  • Yes.

    If your topic has 1 partition means it is in one broker. So, If you increase producer rate for the topic, then that broker becomes busy. But if you have multiple partitions, your Kafka cluster shared those partitions into different brokers and those production rate shared within multiple brokers. So, Writing the same dataset in one topic with 4 partitions is more fast than writing in a topic with 1 partition.

    This not only production rate. In Kafka brokers, There is multiple processes like compactions, compressions, segmentations etc... So with number of messages, that work load becomes high. But with multiple partitions in multiple brokers, it will be distributed.

    However, you don’t necessarily want to use more partitions than needed because increasing partition count simultaneously increases the number of open server files and leads to increased replication latency.

    from kafka documentation

    Distribution The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance. Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.