Search code examples
amazon-web-servicesamazon-dynamodbdynamodb-queriesamazon-dynamodb-index

DynamoDB Design PartitionKey, RangeKey and GSI


I'm designing a new Table over DynamoDB. I already read some documentation but I'm not able to figure out which design schema should I follow to not have problems in a future.

Current Approach

Table - events

 - eventId (HashKey)
 - userId
 - createdAt
 - some other attributes...

Table - users

 - userId (HashKey)
 - name
 - birth
 - address

Events table are going to have a bunch of entries, like millions. Users are going to be about 20 entries at the moment.

I will need to perform the following queries:

 - GET paginated events from specific userId ordered by createdAt
 - GET paginated events from specific userId between some range of dates and ordered by createdAt 
 - GET specific event entry by eventId

So I thought to create a GSI (Global Secondary Index) on events table with the following setup:

 - userId (HashKey)
 - createdAt (RangeKey)

But my question here is: Do my initial design makes sense? Somehow I feel that I could design events table with the following setup:

 - userId (HashKey)
 - eventId (SortKey)

But I think that following this approach I would run into the Hot Partition Pitfall.

Some advices and recommendations would be appreciated.

Thanks.


Solution

  • Your approach seems quite good to me. Keeping in mind the best practices https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html, specifically

    Generally speaking, you should design your application for uniform activity across all logical partition keys in the Table and its secondary indexes. You can determine the access patterns that your application requires, and estimate the total RCUs and WCUs that each table and secondary Index requires.

    Meaning, the data mutation must be as evenly distributed among all partitions as possible. In your case, there are going to be a lot of events, and a limited amount of users, suggesting that each user must be having tons of events.

    If you choose to partition table based on eventid, you will end up with millions of partitions, each having same userid. Assuming you will need to query events by users, the reads will end up distributing evenly among all the partitions. Writes for each event too, will be distributed evenly among all.

    However, if you choose userid as the partition key, more of the requests will end up at same partition, as compared to the other situation. Hence, I will suggest going with former (eventid being the partition key).

    Thats my 2 cents.