Search code examples
apache-kafkakafka-consumer-apikafka-producer-apipartitionkafka-topic

Ordering guarantees in Kafka in case of multiple threads


As far as I understand both Kafka Producer and Consumer have to use a single thread per topic-partition if we want to write / read records in an order. Am I right or maybe they use multiple threads in such situations?


Solution

  • So the ordering can be achieved in Kafka in both single threaded as well as multithreaded env

    single broker/single partition -> Single thread based consumer model

    enter image description here

    The order of message in Kafka works well for a single partition. But with a single partition, parallelism and load balancing is difficult to achieve. Please note that in this case only one thread will be used to access topic partition thus the ordering is always guaranteed.

    multiple brokers/multiple partitions -> Multithread based consumers model(having consumer groups holding more than 1 consumers)

    enter image description here

    In this case, we assume that there are multiple partitions present in topic and each partition is being handled by a single consumer(precisely a single thread) in each consumer group which is fairly called multithreading.

    There are three methods in which we can retain the order of messages within partitions in Kafka. Each method has its own pros and cons.

    Method 1: Round Robin or Spraying
    Method 2 : Hashing Key Partition
    Method 3 : Custom Partitioner

    Round Robin or Spraying (Default)
    In this method, the partitioned will send messages to all the partitions in a round-robin fashion, ensuring a balanced server load. Over loading of any partition will not happen. By this method parallelism and load balancing is achieved but it fails to maintain the overall order but the order within the partition will be maintained. This is a default method and it is not suitable for some business scenarios.

    In order to overcome the above scenarios and to maintain message ordering, let’s try another approach.

    Hashing Key Partition
    In this method we can create a ProducerRecord, specifying a message key with each message being passed to the topic to ensure that partition ordering will happen. The default partitioned will use the hash of the key to ensure that all messages for the same key go to same partition. This is the easiest and most common approach. This is the same method which has been used for hive bucketing as well. It uses modulo operation for hashing.
    Hash(Key) % Number of partitions -> Partition number
    We can say that the key here will help to define the partition where the producer wants to send the message always to maintain the order. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition.

    Custom Partitioner
    We can write our own business logic to decide which message need to be send to which partition. With this approach, we can make ordering of messages as per our business logic and achieve parallelism at the same time.

    For understanding more details you can check below

    https://medium.com/latentview-data-services/how-to-use-apache-kafka-to-guarantee-message-ordering-ac2d00da6c22

    Also Please note that this information represents the Partition level parallelism

    There has been a new parallelism strategy as well called consumer level parallelism. I have not give it a read but you can find details at below confluent link

    https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/