Search code examples
apache-kafkakafka-producer-apikafka-partition

Kafka Partition Does Not Match MurmurHash2 32-bit Algorithm


I'm working on a disaster recovery feature where I need to determine the Kafka partition of a given key in order to replay messages from that partition. I've read that if a key is provided to Kafka, it will use murmur2(key) % numOfPartitions however this doesn't seem to be what is happening in implementation.

Here's a table with the keys, the result of murmur2(key) % numOfParitions, and what it's actually partitioning to.

key murmur2 % 3 Actual partition
AF42CC55DFC84DBC881743CEC2733A22 1 2
209BFB14708147319571502816D3D100 0 0
5A8DE05847404D1DA856EF8E35AE3830 2 1

The topic has 3 partitions and I'm using this online murmurhash2 32-bit algorithm: http://murmurhash.shorelabs.com/

Note the discrepancy in the 1st and 3rd keys - the actual partition does not match the calculated hashed partition.

This article states:

DefaultPartitioner is a Partitioner that uses a 32-bit murmur2 hash to compute the partition for a record (with the key defined) or chooses a partition in a round-robin fashion (per the available partitions of the topic).

Any ideas why the keys AF42CC55DFC84DBC881743CEC2733A22 and 5A8DE05847404D1DA856EF8E35AE3830 are not being stored in the murmur2 hashed partition?


Solution

  • The bytes of the key (assuming StringSerializer, then UTF8 string, with default Kafka encoding) are hashed. The online tool you've used, seems to be using ASCII

    Alternatively, as part of your backup solution, you can store the partition number as a number directly. Then all data under that path will be accurate. Plus, it'll prevent you from calculating hashes, determining how many partitions the topic actually has, and be able to not depend on the default behavior, since producers can easily override that.