Search code examples
phpamazon-dynamodbprovisioned-iops

Purge old data in dynamo db by rate limiting in php


I have dataset in dynamodb, whose primary key is user ID, and timestamp is one of the data attribute. I want to run a purge query on this table, where timestamp is older than 1 week.

I do not want to eat up all writes per s units. I would ideally want a rate limiting delete operation(in php). Otherwise for a dataset that's 10sof GBs in size, it will stop other writes.

I was wondering on lines of usingglobal secondary indexing on timestamp (+user ID) would help reduce the rows to be scanned. But again, I'd not want to thrash table such that other writes start failing.

Can someone provide rate limiting insert/delete example code and references for this in php?


Solution

  • You can create a global secondary index:

    timestampHash (number, between 1 and 100)
    timestamp (number)
    

    Whenever you create/update your timestamp, also set the timestampHash attribute as a random number between 1 to 100. This will distribute the items in your index evenly. You need this hash because to do a range query on a GSI, you need a hash. Querying by user id and timestamp doesn't seem to make sense because that will only return one item every time and you will have to loop over all your users (assuming there is one item per user id).

    Then you can run a purger that will do a query 100 times for each timestampHash number and all items with timestamp older than 1 week. Between each run you can wait 5 minutes, or however long you think is appropriate, depending on the number of items you need to purge.

    You can use BatchWriteItem to leverage the API's multithreading to delete concurrently.

    In pseudocode it looks like this:

    while (true) {
        for (int i = 0; i < 100; i++) {
            records = dynamo.query(timestampHash = i, timestamp < Date.now());
            dynamo.batchWriteItem(records, DELETE);
        }
        sleep(5 minutes);
    }
    

    You can also catch ProvisionedThroughputExceededException and do an exponential back off so that if you do exceed the throughput, you will reasonably stop and wait until your throughput recovers.


    Another way is to structure structure your tables by time.

    TABLE_08292016
    TABLE_09052016
    TABLE_09122016
    

    All your data for the week of 08/28/2016 will go into TABLE_08292016. Then at the end of every week you can just drop the table.