Search code examples
amazon-dynamodb

Indexing by sort key in DynamoDB?


I have a DynamoDB table where I'm aggregating CDN access logs. Specifically I want to track:

For a given customer (all of whose requests can be identified from the URL being downloaded), how many bytes were delivered on their behalf each day?

I have a primary partition key on customer and a primary sort key on time_bucket (day). This way given a customer I can say "find all records from March 1st, 2021 to March 31st, 2021" for instance. So far, so good

The issue arose when I wanted to start deleting old data. Anything older than 5 years should be dropped from the database.

Because the partition key isn't on time_bucket, there's no easy way to say "retrieve all records for May 25th, 2016". Doing so requires a scan instead of a query, and scans are out of the question (unusably slow given how much data I'm handling)

I don't want to swap the partition key and sort key for two reasons:

  • When processing new data to add to the Dynamo table, all new CDN logs will be for the same day. This means that my table will be unbalanced: every write operation made during a single day will hit the same partition key
  • If I wanted to pull a month's worth of data for a single customer I would have to make 30 queries -- one for each day of the month. This gets even worse when pulling a year of data, or 3 years of data

My first thought was "just add an index on the time_bucket column", but when I tried this I got an error:

Attribute Name is duplicated: time_bucket (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: PAN9FVSEMBBJT412NCV013VURNVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)

It seems like DynamoDB does not allow you to create an index on the sort key. So what's the proper solution here?


Solution

  • The right way to handle this is to simply set a 5yr TTL on the records when you put them in DDB.

    Not only will the records be removed automatically, but the removal is free. No WCU is consumed (*see fine print).

    You could add TTL now, but you're going to have to put together a little utility to add a expiration time attribute to the existing records.

    If you want to do it manually, you'll need add Global Secondary Index (GSI). You could do so with the existing timebucket as the GSI hash key. Then you'd Query(GSI, hk='2016-05-01') to find the records and DeleteItem() for each one.

    Note that a GSI has it's own costs, and you'll pay to read the GSI and delete from the table.

    fine print
    As of 2024, while TTL deletes remain free for a standard table and for the initial delete in a Global table, the replication of the delete to the other regions of a global table have a cost.