I'm building an app using DynamoDB as the database. My app has a concept of users, organizations, projects and scripts.
Use cases the app needs to query:
I was thinking of creating a single table Entities
:
pk | sk | type (gsi1-pk) | parent (gsi1-sk) |
---|---|---|---|
root#org-1 | info | org | root |
org-1#proj-1 | info | project | org-1 |
proj-1#script-1 | info | script | proj-1 |
proj-1#script-1 | rev_latest | script_revision | script-1 |
proj-1#script-1 | rev_1 | script_revision | script-1 |
proj-1#script-1 | rev_2 | script_revision | script-1 |
org-1#user-1 | info | user | org-1 |
org-1#user-2 | info | user | org-1 |
PK and SK would allow me to get the data of specific items providing the ID. For the queries where I have the id of a parent object and need to fetch the children, I would create a GSI and query against that.
For PK, I could drop the first token before # which is supposed to represent the parent object which is redundant since it's defined in the parent column but I left it there to prevent accidental cross-object updates.
Am I better off with multiple DynamoDB tables to store the info?
I would prefer not using scan queries for economic reasons. Ideally I can get all the data via query and scan operation.
Here's one single-table option. The design satisfies your access patterns using query operations. No scan operations are required.
pk | sk | gsi1pk | gsi1sk |
---|---|---|---|
org#org-1 | xInfo | ||
proj#proj-1 | xInfo | org#org-1 | proj#proj-1#xInfo |
script#script-1 | xInfo | proj#proj-1 | script#script-1#xInfo |
script#script-1 | rev#2023-10-05T22:00:00 | proj#proj-1 | script#script-1#rev#2023-10-05T22:00:00 |
script#script-1 | rev#2023-03-13T10:00:00 | proj#proj-1 | script#script-1#rev#2023-03-13T10:00:00 |
script#script-1 | rev#2022-12-20T08:00:00 | proj#proj-1 | script#script-1#rev#2022-12-20T08:00:00 |
user#user-1 | xInfo | org#org-1 | user#user-1#xInfo |
user#user-2 | xInfo | org#org-1 | user#user-1#xInfo |
The table primary key models the records. The partition key pk
has the format <entity-type>#<entity-id>
. I find the entity-type
prefix helpful to visually identify a record's entity type, but it is optional here. The sort key sk
defines the record type. Scripts have two record types, info and revisions. Giving script revisions a timestamp "id" makes it easy to retrieve all revisions or only the latest revision1. I renamed info
to xInfo
to move the "info" record after all revisions in lexical ordering2. Why do this? So we can query a script by pk
in descending order (ScanIndexForward: false
) with a limit of 2 to get a script's info record and latest revision in a single operation. No limit returns all revisions and the info record.
Use the table primary key to answer all the "by id" queries. We need an index to handle the "belongs to" queries.
The secondary index partition key (gsi1pk
) models the "belongs to" relation. Organizations appear not to belong to anything, so the index is blank. The index sort key gsi1sk
takes the form <entity-type>#<entity-id>#<record-type>
. The record-type
suffix isn't needed except for scripts, but seems like a good idea to apply everywhere for consistency and future-proofing.
"Get all users belonging to a organization" is answered with a query of gsi1pk = org#org-1
and gsi1sk > "user#"
. The same pattern satisfies the other "belongs to" queries.
A note on your closing question "Am I better off with multiple dynamodb tables...?": Including multiple tables in your design doesn't give you more modelling options. You are still working with partition and sort keys. Multiple tables may be called for in certain cases, but it doesn't really help with modelling as such.
The timestamp component also gives you the option to query for a subset of revisions by time period (e.g. current year's revisions).
You could use ~info
or z
instead of xInfo
, as long as the prefix character comes after rev
in UTF-8 sort order. We do this to engineer a useful script sort key (descending) order: info record, latest revision, latest-1 revision, etc.