apache-kafka kafka-consumer-api kafka-producer-api partition kafka-topic

Ordering guarantees in Kafka in case of multiple threads

As far as I understand both Kafka Producer and Consumer have to use a single thread per topic-partition if we want to write / read records in an order. Am I right or maybe they use multiple threads in such situations?

Solution

So the ordering can be achieved in Kafka in both single threaded as well as multithreaded env

single broker/single partition -> Single thread based consumer model

The order of message in Kafka works well for a single partition. But with a single partition, parallelism and load balancing is difficult to achieve. Please note that in this case only one thread will be used to access topic partition thus the ordering is always guaranteed.

multiple brokers/multiple partitions -> Multithread based consumers model(having consumer groups holding more than 1 consumers)

In this case, we assume that there are multiple partitions present in topic and each partition is being handled by a single consumer(precisely a single thread) in each consumer group which is fairly called multithreading.

There are three methods in which we can retain the order of messages within partitions in Kafka. Each method has its own pros and cons.

Method 1: Round Robin or Spraying
Method 2 : Hashing Key Partition
Method 3 : Custom Partitioner

Round Robin or Spraying (Default)
In this method, the partitioned will send messages to all the partitions in a round-robin fashion, ensuring a balanced server load. Over loading of any partition will not happen. By this method parallelism and load balancing is achieved but it fails to maintain the overall order but the order within the partition will be maintained. This is a default method and it is not suitable for some business scenarios.

In order to overcome the above scenarios and to maintain message ordering, let’s try another approach.

Hashing Key Partition
In this method we can create a ProducerRecord, specifying a message key with each message being passed to the topic to ensure that partition ordering will happen. The default partitioned will use the hash of the key to ensure that all messages for the same key go to same partition. This is the easiest and most common approach. This is the same method which has been used for hive bucketing as well. It uses modulo operation for hashing.
Hash(Key) % Number of partitions -> Partition number
We can say that the key here will help to define the partition where the producer wants to send the message always to maintain the order. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition.

Custom Partitioner
We can write our own business logic to decide which message need to be send to which partition. With this approach, we can make ordering of messages as per our business logic and achieve parallelism at the same time.

For understanding more details you can check below

https://medium.com/latentview-data-services/how-to-use-apache-kafka-to-guarantee-message-ordering-ac2d00da6c22

Also Please note that this information represents the Partition level parallelism

There has been a new parallelism strategy as well called consumer level parallelism. I have not give it a read but you can find details at below confluent link

https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/