amazon-web-services date amazon-dynamodb dynamodb-queries amazon-dynamodb-index

DynamoDB date GSI

I have a DynamoDB table that stores executions of some programs, this is what it looks like:

Partition Key	Sort Key	StartDate	...
program-name	execution-id (uuid)	YYYY-MM-DD HH:mm:ss	...

I have two query scenarios for this table:

Query by program name and execution id (easy)
Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59

What is the correct way to perform the second query?

I understand I need to create a GSI to do that, but how should this GSI look like?

I was thinking about splitting the StartDate attribute into two, like this:

Partition Key	Sort Key	StartMonthYear	StartDayTime	...
program-name	execution-id (uuid)	YYYY-MM	DD HH:mm:ss	...

So I can define a GSI using the StartMonthYear as the partition key and the StartDayTime as the sort key.

The only problem with this approach is that I would have to write some extra logic in my application to identify all the partitions I would need to query in the requested range. For example:

If the range is: 2021-05-15 00:00:00 to 2021-07-15 23:59:59

I would need to query 2021-05, 2021-06 and 2021-07 partitions with the respective day/time restrictions (only the first and last partition is this example).

Is this the correct way of doing this or am I totally wrong?

Solution

If you quickly want to fetch all executions in a certain time-frame no matter the program, there are a few ways to approach this. The easiest solution would be a setup like this:

PK	SK	GSI1PK	GSI1SK	StartDate
`PROG#<name>`	`EXEC#<uuid>`	`ALL_EXECUTIONS`	`S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>`	`yyyy-mm-ddThh:mm:ss`

PK is the partition key for the base table
SK is the sort key for the base table
GSI1PK is the partition key for the global secondary index GSI1
GSI1SK is the sort key for the global secondary index GSI1

Query by program name and execution id (easy)

Still easy, do a GetItem based on the program name for <name> and uuid for <uuid>.

Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59

Do a Query on GSI1 with the KeyConditionExpression: PK = ALL_EXECUTIONS AND SK >= 'S#2021-05-15 00:00:00' AND SK <= 'S#2021-07-15 23:59:59'. This would return all the executions in the given time range.

But: You'll also build a hot partition, since you effectively write all your data in a single partition in GSI1.

To avoid that, we can partition the data a bit and the partitioning depends on the number of executions you're dealing with. You can choose years, months, days, hours, minutes or seconds.

Instead of GSI1PK just being ALL_EXECUTIONS, we can set it to a subset of the StartDate.

PK	SK	GSI1PK	GSI1SK	StartDate
`PROG#<name>`	`EXEC#<uuid>`	`EXCTS#<yyyy-mm>`	`S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>`	`yyyy-mm-ddThh:mm:ss`

In this case you'd have a monthly partition, i.e.: all executions per month are grouped. Now you would have to make multiple queries to DynamoDB and later join the results.

For the query range from 2021-05-15 00:00:00 to 2021-07-15 23:59:59 you'd have to do these queries on GSI1:

@GSI1: GSI1PK=EXCTS#2021-05 AND GSI1SK >= S#2021-05-15 00:00:00
@GSI1: GSI1PK=EXCTS#2021-06
@GSI1: GSI1PK=EXCTS#2021-07 AND GSI1SK <= S#2021-07-15 23:59:59

You can even parallelize these and later join the results together.

Again: Your partitioning scheme depends on the number of executions you have in a day and also which maximum query ranges you want to support.

This is a long-winded way of saying that your approach is correct in principle, but you can choose to tune it based on your use case.