Search code examples
amazon-dynamodbdynamodb-queries

How to design a recommendation system in DynamoDb based on likes


Considering performance as main importance, how would be the best design approach to follow in order to build a recommendation system in DynamoDb?

The system would need to store an url and the numbers of times that topic was 'liked', but my requirement includes the need of searches by daily, weekly, monthly and yearly, for e.g.:

Give me the top 10 of the week
Give me the top 10 of the month

I was thinking about to include the date and time information, so that the query could control it through this field, but not sure wether it is the good in terms of performance.


Solution

  • If the only data structure you had was a hash map, how would you solve this problem?

    What if on top of that constraint, you could only update any key up to 1000 times per second, and read a key up to 3000 per second?

    How often do you expect your items to get liked? Presumably there will be some that will be hot and liked a lot, while others would almost never get any likes.

    How real_time does your system need to be? Can the system be eventually consistent (meaning, would it be ok if you only reported likes as of several minutes ago)?


    Let's give this a shot
    Disclaimer: this is very much a didactic exercise -- in practice you may want to explore an analytics product, or some other technologies than DynamoDB to accompish this task


    Part 1. Representing an Item And Updating Like Counts

    First, let's talk about your aggregation/analytics goals: you mentioned that you want to query for "top 10 of the week" or "top 10 of the month" but you didn't specify if that is supposed to mean "calendar week"/"calendar month", or "last 7 days"/"last 30 days".

    I'm going to take it literally and assume that "top 10 of the week" means top 10 items from this week that started on the most recent Monday (or Sunday if you roll that way). Same for month: "top 10 of the month" means "top 10 items since the beginning of this month.

    In this case, you will probably want to store, for each item:

    • a count of total all-time likes
    • a count of likes since the beginning of current month
    • a count of likes since the beginning of current week
    • current month number - needed to determine if we need to reset
    • current week number - needed to determine if we need to reset

    And each week, reset the counts for the current week; And each month reset the counts for the current month.

    In DynamoDB, this might be represented like so:

    {
       id: "<item-id>",
       likes_all: <numeric>,   // total likes of all time
       likes_wk: <numeric>,    // total likes for the current week
       likes_mo: <numeric>,    // total likes for the current month
       curr_wk: <numeric>,     // number of the current week of year, eg. 27
       curr_mo: <numeric>,     // number of the current month of year, eg. 6
    }
    

    Now, you can update the number of likes with an UpdateItem operation, with an UpdateExpression, like so:

    dynamodb update-item \
        --table-name <your-table-name> \
        --key '{"id":{"S":"<item-id>"}}' \
        --update-expression "SET likes_all = likes_all + :lc, likes_wk = likes_wk + :lc, likes_mo = likes_mo + :lc" \
        --expression-attribute-values '{":lc": {"N":"1"}}' \
        --return-values ALL_NEW
    

    This gives you a simple atomic way to increment the counts and get back the updated values. Notice the :lc value can be any number (not just 1). This will come in handy below.

    But there's a catch. You also need to be able to reset the counts if the week or month rolled over, so to do that, you can break the update into two operations:

    1. update the total count (and get the most recent values back)
    2. conditionally update the week and month counts

    So, our update sequence becomes:

    Step 1. update total count and read back the updated item:

    dynamodb update-item \
        --table-name <your-table-name> \
        --key '{"id":{"S":"<item-id>"}}' \
        --update-expression "SET likes_all = likes_all + :lc" \
        --expression-attribute-values '{":lc": {"N":"1"}}' \
        --return-values ALL_NEW
    

    This updates the total count and gives us back the state of the item. Based on the values of the curr_wk and curr_mo, you will have to decide what the update looks like. You may be either incrementing, or setting an absolute value. Let's say we're in the case when the update is being performed after the week rolled over, but not the month. And let's say that the result of the update above looks like this:

    {
       id: "<item-id>",
       likes_all: 1000,  // total likes of all time
       likes_wk: 70,     // total likes for the current week
       likes_mo: 150,    // total likes for the current month
       curr_wk: 26,      // number of the week of last update
       curr_mo: 6,       // number of the month of year of last update
    }
    

    curr_wk is 6, but at the time of update, the actual current week should be 7.

    Then your update query would look look like this:

    dynamodb update-item \
        --table-name <your-table-name> \
        --key '{"id":{"S":"<item-id>"}}' \
        --update-expression "SET curr_wk = 27, likes_wk = :lc, likes_mo = likes_mo + :lc" \
        --condition-expression "curr_wk = :wk AND curr_mo = :mo" \
        --expression-attribute-values '{":lc": {"N":"1"}, ":wk": {"N":"26"}, ":lc": {"N":"6"},}' \
        --return-values ALL_NEW
    

    The ConditionExpression ensures that we don't reset the likes twice, if two conflicting updates happen at the same time. In that case, one of the updates would fail and you'd have to switch the update back to an increment.


    Part 2 - Keeping Track of Statistics

    To take care of your statistics you need to keep track of most likes per week and per month.

    You can keep a sorted list of hottest items per week and per month. You also can store these lists in Dynamo.

    For example, let's say you want to keep track of top 3. You might store something like:

    {
       id: "item-stats",
       week_top: ["item3:4000", "item2:2000", "item9:700"],
       month_top: ["item2:100000", "item4:50000", "item3:12000"],
       curr_wk: 26,
       curr_mo: 6,
       sequence: <optimistic-lock-token>
    }
    

    Whenever you perform an update for items, you would also update the statistics.

    The algorithm for updating statistics will be similar to updating an item, except you can't just use update expressions. Instead you have to implement your own read-modify-write sequence using GetItem, PutItem and ConditionExpression.

    First, you read the current values for the item-stats special item, including the value of the current sequence (this is important to detect clobbering)

    Then, you figure out if the item(s) you've just updated counts for would make it into the Top-N weekly or monthly list. If so, you would update the week_top and/or month-top attributes and prepare a conditional PutItem request.

    The PutItem request must include a conditional check that verifies the sequeuce of the item-stats is the same as what you read earlier. If not, you need to read the item again and re-compute the top-N lists, then attempt to put again.

    Also, similar to the way the counts get reset for items, when an update happens you need to check and see if the weekly or monthly top needs to be reset as part of the update.

    When you make the PutItem request, make sure to generate a new sequence value.


    Part 3 - Putting It All Together

    In Part 1 and Part 2 we figured out how to keep track of likes and keep track of statics but there are big problems with our approach: performance would be pretty bad with any kind of real-life scale; hot items would create problems for us; updating the Top-N stats would be a significant bottleneck.

    To improve performance and achieve some scalability we'd want to get away from updating each item and the item-stats for every single "like".

    We can achieve a good balance of performance and scalability using a combination of queues + dynamodb + compute resource.

    1. create a queue to store pending likes
    2. let "likes API" would enqueue a message tagging a post with a like, instead of applying them as they come
    3. implement a queue consumer (could be a Lambda, or some other periodically running process) to pull messages off the queue and aggregate likes per item, then update items and the item-stats

    By batching updates, we can get control over concurrency (and cost) at the expense of latency/eventual consistency.

    We may end up with a limited number of queue consumers, each processing items in batches. In each batch, multiple item likes would be aggregated and a single update per item would be applied. Similarly, a single item-stats update would be applied per batch processor.

    Depending on volume of incoming likes, you may need to spin up more processors.