Search code examples
amazon-web-servicesdateamazon-dynamodbdynamodb-queriesamazon-dynamodb-index

DynamoDB date GSI


I have a DynamoDB table that stores executions of some programs, this is what it looks like:

Partition Key Sort Key StartDate ...
program-name execution-id (uuid) YYYY-MM-DD HH:mm:ss ...

I have two query scenarios for this table:

  1. Query by program name and execution id (easy)
  2. Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59

What is the correct way to perform the second query?

I understand I need to create a GSI to do that, but how should this GSI look like?

I was thinking about splitting the StartDate attribute into two, like this:

Partition Key Sort Key StartMonthYear StartDayTime ...
program-name execution-id (uuid) YYYY-MM DD HH:mm:ss ...

So I can define a GSI using the StartMonthYear as the partition key and the StartDayTime as the sort key.

The only problem with this approach is that I would have to write some extra logic in my application to identify all the partitions I would need to query in the requested range. For example:

If the range is: 2021-05-15 00:00:00 to 2021-07-15 23:59:59

I would need to query 2021-05, 2021-06 and 2021-07 partitions with the respective day/time restrictions (only the first and last partition is this example).

Is this the correct way of doing this or am I totally wrong?


Solution

  • If you quickly want to fetch all executions in a certain time-frame no matter the program, there are a few ways to approach this. The easiest solution would be a setup like this:

    PK SK GSI1PK GSI1SK StartDate
    PROG#<name> EXEC#<uuid> ALL_EXECUTIONS S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid> yyyy-mm-ddThh:mm:ss
    • PK is the partition key for the base table
    • SK is the sort key for the base table
    • GSI1PK is the partition key for the global secondary index GSI1
    • GSI1SK is the sort key for the global secondary index GSI1
    1. Query by program name and execution id (easy)

    Still easy, do a GetItem based on the program name for <name> and uuid for <uuid>.

    1. Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59

    Do a Query on GSI1 with the KeyConditionExpression: PK = ALL_EXECUTIONS AND SK >= 'S#2021-05-15 00:00:00' AND SK <= 'S#2021-07-15 23:59:59'. This would return all the executions in the given time range.

    But: You'll also build a hot partition, since you effectively write all your data in a single partition in GSI1.

    To avoid that, we can partition the data a bit and the partitioning depends on the number of executions you're dealing with. You can choose years, months, days, hours, minutes or seconds.

    Instead of GSI1PK just being ALL_EXECUTIONS, we can set it to a subset of the StartDate.

    PK SK GSI1PK GSI1SK StartDate
    PROG#<name> EXEC#<uuid> EXCTS#<yyyy-mm> S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid> yyyy-mm-ddThh:mm:ss

    In this case you'd have a monthly partition, i.e.: all executions per month are grouped. Now you would have to make multiple queries to DynamoDB and later join the results.

    For the query range from 2021-05-15 00:00:00 to 2021-07-15 23:59:59 you'd have to do these queries on GSI1:

    • @GSI1: GSI1PK=EXCTS#2021-05 AND GSI1SK >= S#2021-05-15 00:00:00
    • @GSI1: GSI1PK=EXCTS#2021-06
    • @GSI1: GSI1PK=EXCTS#2021-07 AND GSI1SK <= S#2021-07-15 23:59:59

    You can even parallelize these and later join the results together.

    Again: Your partitioning scheme depends on the number of executions you have in a day and also which maximum query ranges you want to support.

    This is a long-winded way of saying that your approach is correct in principle, but you can choose to tune it based on your use case.