Search code examples
database-designterraformamazon-dynamodbnosqldatabase-migration

How to Minimise Data Loss When Migrating a High-Traffic DynamoDB Table In Order to Redefine Its Schema (Introducing a Sort Key)?


Preface: I am aware that the direct answer to "Can A Sort Key be Retroactively Added to An Existing DynamoDB Table?" is a no, as indicated by Add range key to DynamoDB table without destroying the table and How to add primary sort key to an already existing table in AWS dynamo db?

As it will become more evident, the intention of this post is to explore the necessary actions/considerations that should be acknowledged beyond the immediate answer to the aforementioned question.


Context

I have a legacy DynamoDB table that sits along a data pipeline with millions of records that have flown through and stored in the table over the course of its existence.

This DynamoDB table primarily acts as a KV store, with its simplified schema represented as:

{
 id: "bfk1r3sdvx5937df",
 record_data: {
   key_0: value_0,
   key_1: value_1,
   key_2: value_2,
   ...
 },
}

The table's partition key, id, is a unique hash created when individual records are ingested and saved into DynamoDB, and id persists within these records as they flow through the remaining stages of the data pipeline beyond the table.

Additionally, the record_data nested KV object, represented in the above schema, is not always a unique value. In fact, there are multiple instances within this table whereby a particular set of records have record_data values that are 1:1 matches with one another.

Given that the two above-mentioned components make up the entire table schema (simplified for this context), the aforementioned id partition key is the only reference with which to query the table.

Now, the limitation that comes with solely being able to query the table via its id partition key is that it only serves to be useful when querying the table with a specific known id value for a particular record.

For other purposes, there are no further sortable/discernible patterns within the current schema that allows for a more granular and intuitive query.

With all this said, I'm thinking of ways to add an appropriate sort key to this table so that querying its millions of records is more feasible.

In other parts of the data pipeline, notably where the same records are stored, the records are stored with date timestamps attached.

Therefore, it would make a lot of sense to add a UTC timestamp (eg. createdAt) to future records in this DynamoDB table as its new sort_key and, hence, improve querying capabilities and data consistency as the table scales.

Blockers

However, there are several technical blockers with the above approach:

One, DynamoDB prohibits retroactive redefinition of the table schema beyond the table's initial provisioning:

You need to add the sort key at the initial table creation state as you can not add a sort key for an already existing table.

Two, DynamoDB prohibits the presence of empty values that correspond to sort key attributes, thereby creating the need to backfill existing records with non-empty values:

Partition key and sort key attributes of base tables continue to require non-empty values for all data types, including String and Binary.

Similarly, String and Binary key attributes of local secondary indexes or global secondary indexes also require non-empty values.  

Solution

With both considered, the only viable way (outside of abandoning DynamoDB as the datastore for these particular records) would be to:

  1. Create a new table with a redefined schema; the schema would be identical to that of the old table, with the exception of the sort_key

  2. Migrate the millions of stored records of the old table into the new table, backfilling the empty sort_key values of old records during their migration

TLDR

That being said, my question is: as the old table is currently live in production, with new data dynamically stored every second, what would be the safest way to perform the migration of this existing DynamoDB table to a new table, in a way that minimises the risk of data loss?


Further Considerations

  1. Data saved in this table is initially collected and streamed via various client-side applications (earlier in the data pipeline) that are internationalised and collect data across various timezones.

    • Given the internationalised nature of data traffic, it would be challenging to define the table's 'low traffic period': the time period in which it would otherwise make the most sense to undergo its migration.
  2. It would be ideal to preserve (and not leave behind) the millions of records saved in the old table, given the continued value of these records.

  3. When the new table is created, the old table needs to be retired so as to preserve the single source of truth and avoid data duplication.

  4. It would also be ideal to explore approaches that will not involve provisioning additional services to the current architecture where this DynamoDB table sits.

  5. The new DynamoDB table will be deployed via Terraform, as was the old table.

  6. No further restrictions.


Solution

  • You're making this leap of faith that adding a createdAt sort key would be beneficial to you. What would your partition key be? You don't say, and in fact there's no good answer. I'm guessing you're planning for it to be id still but if so then you haven't improved your situation. You can't directly retrieve "last 10 items" if each item has a different partition key.

    What you should do is keep this table and add a GSI. It'll auto backfill, simplifying all your transition questions. In that GSI you can have the sort key be a timestamp. For the partition key have it be some sharded value. Pick as many shards as you think you need write capacity. Each shard can accept 1,000 write units per second. So if you're needing to handle max 10,000 items per second then have at least 10 shards. Name the attributes GSI1PK and GSI1SK. Into GSI1PK put a value "1" through "10". Into GSI1SK put the timestamp. When you want to run a time-oriented query you'll have to make as many queries as you have shards and merge the results together.

    Thus you can retrieve by ID from the base table and retrieve by timestamp from the GSI and handle arbitrary load rates.