Search code examples
amazon-kinesisamazon-kcl

How to use ExplicitHashKey for round robin stream assignment in AWS Kinesis


I am trying to pump lots of data through Amazon Kinesis (order 10,000 points per second).

In order to maximize records per second through my shards, I'd like to round robin my requests over the shards (my application logic doesn't care what shard individual messages go to).

It would seem I could do this with the ExplicitHashKey parameter for the messages in the list I am sending to the PutRecords endpoint - however the Amazon documentation doesn't actually describe how to use ExplicitHashKey, other than the oracular statement of:

http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

Each record in the Records array may include an optional parameter, ExplicitHashKey, which overrides the partition key to shard mapping. This parameter allows a data producer to determine explicitly the shard where the record is stored. For more information, see Adding Multiple Records with PutRecords in the Amazon Kinesis Streams Developer Guide.

(The statement in the docs above has a link to another section of the documentation, which does not discuss ExplicitHashKeys at all).

Is there a way to use ExplicitHashKey to round robin data among shards?

What are valid values for the parameter?


Solution

  • Each shard is assigned a sequential range of 128 bit integers from 0 to 2^128 - 1.

    You may find the range of integers assigned to a given shard in a stream via the AWS CLI:

    aws kinesis describe-stream --stream-name name-of-your-stream

    The output will look like:

    {
        "StreamDescription": {
            "RetentionPeriodHours": 24, 
            "StreamStatus": "ACTIVE", 
            "StreamName": "name-of-your-stream", 
            "StreamARN": "arn:aws:kinesis:us-west-2:your-stream-info", 
            "Shards": [
               {
                    "ShardId": "shardId-000000000113", 
                    "HashKeyRange": {
                        "EndingHashKey": "14794885518301672324494548149207313541", 
                        "StartingHashKey": "0"
                    }, 
                    "ParentShardId": "shardId-000000000061", 
                    "SequenceNumberRange": {
                        "StartingSequenceNumber": "49574208032121771421311268772132530603758174814974510866"
                    }
                }, 
               { ... more shards ... }
           ...
    

    You may set the ExplicitHashKey of a record to the string decimal representation of an integer value anywhere in the range of hash keys for a shard to force it to be sent to that particular shard.

    Note that due to prior merge and split operations on your shard, there may be many shards with overlapping HashKeyRanges. The currently open shards are the ones that do not have a SequenceNumberRange.EndingSequenceNumber element.

    You can round robin requests among a set of shards by identifying an 128 bit integer within the range of each of the shards of interest, and round robin assigning the string representation of that number to each record's ExplicitHashKey.

    As a side note, you can also calculate the hash value a given PartitionKey will evaluate to by:

    1. Compute the MD5 sum of the partition key.
    2. Interpret the MD5 sum as a hexadecimal number and convert it to base 10. This will the the hash key for that partition key. You can then look up what shard that hash key falls into.