Search code examples
hashapache-kafkakeypartitioner

The default Kafka partitioner create hash key collision


I have a topic with 10 partitions, and I have generate events with A,B,C,D,E,F,G,H,I 9 different keys.

I've observed messages doing this:

Partition 0- (Message1, Key E), (Message2, Key I)
Partition 1- (Message3, Key F) 
. 
. 
Partition7-(Message4, Key A), (Message5, Key A)
Partition8- Empty 
Partition9- Empty

There are 2 messages with different keys in the same partition and there are empty partitions as well.

Is the default partitioner of Kafka creating collisions?

I am producing from one stream which is balanced to two default rest producers.

This is what I was expecting:

 Partition 0- (Message1, Key E)
 Partition 1- (Message3, Key F) 
 . 
 . 
 Partition7-(Message4, Key A), (Message5, Key A)
 Partition8-(Message2, Key I) 
 Partition9- Empty

Solution

  • Kafka's DefaultPartitioner uses a murmur hash algorithm at the producer client side to assign a partition to each message. There is no guarantee that for 10 partitions and single digit number of keys, they will be uniformly distributed. Calculation of partition for each message is independent of each other and the probability of collision is a mathematical interest.

    EDIT:

    It is very unlikely that murmur hash algorithm results in a collision. Partitions in Kafka topic is fixed - it cannot grow unlike bucket size in java HashMap implementation. So partition algorithm uses a formula which calculates modulo of number of partitions. Exact formula is Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

    Now you can see that two different keys can indeed result in same partition number if hash mod number of partitions results in same value.

    For a large number of random key set, keys will be uniformly distributed across all partitions.

    If you want ordering, then you must use a partition key..in which case your worries surrounding collisions and empty partitions have little practical consequences (well, for a large set of random keys, they will be ok). If you assumed that Kafka would centrally make sure that empty partitions are filled first before a key is routed to an already filled partition, that is not how things work