This is a simplified version of my problem using a DynamoDB Table. Most items in the Table represent sales across multiple countries. One of my required access patterns is to retrieve all sales in countries which belong to a certain country_grouping
between a range of order_date
s. The incoming stream of sales data contains the country
attribute, but not the country_grouping
attribute.
Another entity in the same Table is a reference table, which is infrequently updated, which could be used to identify the country_grouping
for each country
. Can I design a GSI or otherwise structure the table to retrieve all sales for a given country_grouping
between a range of order dates?
Here's an example of the Table structure:
PK | SK | sale_id | order_date | country | country_grouping |
---|---|---|---|---|---|
SALE#ID#1 | ORDER_DATE#2022-06-01 | 1 | 2022-06-01 | UK | |
SALE#ID#2 | ORDER_DATE#2022-09-01 | 2 | 2022-09-01 | France | |
SALE#ID#3 | ORDER_DATE#2022-07-01 | 3 | 2022-07-01 | Switzerland | |
COUNTRY_GROUPING#EU | COUNTRY#France | France | EU | ||
COUNTRY_GROUPING#NATO | COUNTRY#UK | UK | NATO | ||
COUNTRY_GROUPING#NATO | COUNTRY#France | France | NATO |
Possible solution 1
As the sales items are streamed into the Table, query the country_grouping
associated with the country
in the sale, and write the corresponding country_grouping
to each sale record. Then create a GSI where country_grouping
is the partition key and the order_date
is the sort key. This seems expensive to me, consuming 1 RCU and 1 WCU per sale record imported. If country groupings changed (imagine the UK rejoins the EU), then I would run an update operation against all sales in the UK.
Possible solution 2
Have the application first query to retrieve every country
in the desired country_grouping
, then send an individual request for each country using a GSI where the partition key is country
and the order_date
is the sort key. Again, this seems less than ideal, as I consume 1 WCU per country, plus the 1 WCU to obtain the list of countries.
Is there a better way?
Picking an optimal solution depends on factors you haven't mentioned:
For example, if your Sale items are large and/or you insert a lot every second, you're going to need to worry about creating a hot key in the GSI. I'm going to assume your update rate is not too high, the Sale item size isn't too large, and you're going to have thousands or more Sale items per country.
If my assumptions are correct, then I'd go with Solution 2. You'll spend one read unit (it's not a WCU but rather an RCU, and it's only half a read unit if eventually consistent) to Query the country group and get a list of countries. Do one Query for each country in that group to pull all the Sale items matching the specific time range for that country. Since there are lots of matching sales, the cost is about the same. One 400 KB pull from a country_grouping PK is the same cost as 4 100 KB pulls from four different country PKs. You can also do the country Query calls in parallel, if you want, to speed execution. If you're returning megabytes of data or more, this will be helpful.
If in fact you have only a few sales per country, well, then any design will work.