I have a DynamoDB table that stores executions of some programs, this is what it looks like:
Partition Key | Sort Key | StartDate | ... |
---|---|---|---|
program-name | execution-id (uuid) | YYYY-MM-DD HH:mm:ss | ... |
I have two query scenarios for this table:
2021-05-15 00:00:00
to 2021-07-15 23:59:59
What is the correct way to perform the second query?
I understand I need to create a GSI to do that, but how should this GSI look like?
I was thinking about splitting the StartDate
attribute into two, like this:
Partition Key | Sort Key | StartMonthYear | StartDayTime | ... |
---|---|---|---|---|
program-name | execution-id (uuid) | YYYY-MM | DD HH:mm:ss | ... |
So I can define a GSI using the StartMonthYear
as the partition key and the StartDayTime
as the sort key.
The only problem with this approach is that I would have to write some extra logic in my application to identify all the partitions I would need to query in the requested range. For example:
If the range is: 2021-05-15 00:00:00
to 2021-07-15 23:59:59
I would need to query 2021-05
, 2021-06
and 2021-07
partitions with the respective day/time restrictions (only the first and last partition is this example).
Is this the correct way of doing this or am I totally wrong?
If you quickly want to fetch all executions in a certain time-frame no matter the program, there are a few ways to approach this. The easiest solution would be a setup like this:
PK | SK | GSI1PK | GSI1SK | StartDate |
---|---|---|---|---|
PROG#<name> |
EXEC#<uuid> |
ALL_EXECUTIONS |
S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid> |
yyyy-mm-ddThh:mm:ss |
- Query by program name and execution id (easy)
Still easy, do a GetItem
based on the program name for <name>
and uuid for <uuid>
.
- Query by start date range, for example: all executions from
2021-05-15 00:00:00
to2021-07-15 23:59:59
Do a Query on GSI1 with the KeyConditionExpression: PK = ALL_EXECUTIONS AND SK >= 'S#2021-05-15 00:00:00' AND SK <= 'S#2021-07-15 23:59:59'
. This would return all the executions in the given time range.
But: You'll also build a hot partition, since you effectively write all your data in a single partition in GSI1.
To avoid that, we can partition the data a bit and the partitioning depends on the number of executions you're dealing with. You can choose years, months, days, hours, minutes or seconds.
Instead of GSI1PK just being ALL_EXECUTIONS, we can set it to a subset of the StartDate.
PK | SK | GSI1PK | GSI1SK | StartDate |
---|---|---|---|---|
PROG#<name> |
EXEC#<uuid> |
EXCTS#<yyyy-mm> |
S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid> |
yyyy-mm-ddThh:mm:ss |
In this case you'd have a monthly partition, i.e.: all executions per month are grouped. Now you would have to make multiple queries to DynamoDB and later join the results.
For the query range from 2021-05-15 00:00:00
to 2021-07-15 23:59:59
you'd have to do these queries on GSI1:
GSI1PK=EXCTS#2021-05 AND GSI1SK >= S#2021-05-15 00:00:00
GSI1PK=EXCTS#2021-06
GSI1PK=EXCTS#2021-07 AND GSI1SK <= S#2021-07-15 23:59:59
You can even parallelize these and later join the results together.
Again: Your partitioning scheme depends on the number of executions you have in a day and also which maximum query ranges you want to support.
This is a long-winded way of saying that your approach is correct in principle, but you can choose to tune it based on your use case.