Search code examples
amazon-web-servicesamazon-dynamodbdynamodb-queries

How to limit dynamodb scan to a given partition key and NOT read the entire table


Theoretical table with billions of entries.

Partition key is a unique uuid representing a given deviceId. There will be around 10k unique uuids.

Sort Key is a dateString for when the data was collected.

Each item has some data fields. There are dozens of fields such that making a GSI for each wouldn't be reasonable. For our example, let's say we are looking for the "dataOfInterest" field.

I'd like to search the DB for "all items where the dataOfInterest = 'foobar'" - and ideally do it within a date range. As far as I know, a scan operation is the only option. With billions entries... that's not going to be a fast process (though I understand I could split it out to run multiple operations at a time - it's stil going to eat RCU's like crazy)

Of note, I only care about a given uuid for each search, however. In other words, what I REALLY care about is "all items within a given partition where the dataOfInterest = 'foobar'". And futher, it'd be great to use the sort key to give "all items within a given partition where the dataOfInterest = 'foobar' that are between Jan 1 and Feb 28"

The scan operation allows you to limit the results with a filter expression such that I could get the results of just a single partition ... but it still reads the entire table and the filtering is done before returning the data to you. https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html

Is there an AWS API that does a scan-like operation that reads only a given partition? Are there other ways to achieve this (perhaps re-architecting the DB?)


Solution

  • As @jarmod says, you can use a Query and specify the PK of the UUID. You can then either put the timestamp into the SK and filter for the dataOfInterest value (unindexed), or for more efficiency and to make everything indexed you can construct a composite SK which is dataOfInterest#timestamp and then do a range query on the SK of foobar#time1 to foobar#time2. That makes this query perfectly index optimized.

    Course, this makes purely timestamp-based queries less simple. So you either do multiple queries for those or, if you want both queries efficient, setup this composite SK in a GSI and use that to resolve this query.