Search code examples
cachingamazon-dynamodbamazon-dynamodb-dax

Best way to cache large data that is added to (DynamoDB)


I am currently working with large amounts of data that I'm storing in DynamoDB. Once data enters the database it never changes, but new data is flowing into the database consistently. My question is how can I perform a data cache (utilizing DAX if possible) to limit the amount of data that I have to directly query the database for.

For example, if I want the data from 10:00 AM to 11:00 AM then I can query with the parameters of:

start_time = 10:00 AM, end_time = 11:00 AM

The response from this query will be cached in DAX for later use. My problem is that when I go to get data between 10:00 AM and 1:00 PM I have to query for data that is already in my cache (this is because the caching is based on parameters and I have new parameters).

My first thought was to cache the data in small sections and just make many queries. For example:

Request for 10 - 10:15 AM data and cache, then request for 10:15 - 10:30 AM data then cache, and so on. By doing this I could make many smaller queries but won't have overlapping data in my cache. Is this the best approach or should I cache the overlapping data. Any help is appreciated.


Solution

  • If i understood correctly:

    start_time = 10:00 AM, end_time = 11:00 AM ( Cache has no data, hits DynamoDB )
    start_time = 10:00 AM, end_time = 11:00 AM ( Cache has this data, doesn't hit DynamoDB )
    start_time = 10:00 AM, end_time = 10:30 AM ( Difference in cache keys, hits DynamoDB )
    

    Basically you could be having a full set of data in Cache, but unless you are using the same cache keys (that helps result in a cache hit), the Cache could never return smartly you a "subset" of the full data from Cache

    DynamoDB DAX Item Cache

    DyanmoDB DAX brings along Item Cache, where individual Items are stored and returned from DAX. However Item Cache is only limited to only GetItem and BatchGetItem

    https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.concepts.html#DAX.concepts.item-cache

    Fragmenting DDB Query

    If DynamoDB DAX is not possible, or Query and Scan operations are needed. Then the next better least invasive technique is to fragment / partition the DDB query into "smaller" queries so that they will result in more Cache hits

    e.g.

    start_time = 10:00 AM, end_time = 10:15 AM
    start_time = 10:15 AM, end_time = 10:30 AM
    start_time = 10:30 AM, end_time = 10:45 AM
    

    There are few good third party application libraries you can use to partition your Query Keys, and you can choose the granularity from 15 minute blocks to 1 minute blocks or even seconds block, suited to your performance needs

    But this technique will not be without Cons, clearly the additional number of hops / queries it must now make needs to be taken into consideration

    Application ORM

    Solving problems like these are what application ORMs are really good at, for example Hibernate in the case of Java development (But the last i checked, Hibernate doesn't have support for DynamoDB quite yet, although it is possible to extend and build custom strategies)

    You could check if your application ORM has support for DynamoDB

    https://www.baeldung.com/hibernate-second-level-cache