Considering performance as main importance, how would be the best design approach to follow in order to build a recommendation system in DynamoDb?
The system would need to store an url and the numbers of times that topic was 'liked', but my requirement includes the need of searches by daily, weekly, monthly and yearly, for e.g.:
Give me the top 10 of the week
Give me the top 10 of the month
I was thinking about to include the date and time information, so that the query could control it through this field, but not sure wether it is the good in terms of performance.
If the only data structure you had was a hash map, how would you solve this problem?
What if on top of that constraint, you could only update any key up to 1000 times per second, and read a key up to 3000 per second?
How often do you expect your items to get liked? Presumably there will be some that will be hot and liked a lot, while others would almost never get any likes.
How real_time does your system need to be? Can the system be eventually consistent (meaning, would it be ok if you only reported likes as of several minutes ago)?
Let's give this a shot
Disclaimer: this is very much a didactic exercise -- in practice you may want to explore an analytics product, or some other technologies than DynamoDB to accompish this task
Part 1. Representing an Item And Updating Like Counts
First, let's talk about your aggregation/analytics goals: you mentioned that you want to query for "top 10 of the week" or "top 10 of the month" but you didn't specify if that is supposed to mean "calendar week"/"calendar month", or "last 7 days"/"last 30 days".
I'm going to take it literally and assume that "top 10 of the week" means top 10 items from this week that started on the most recent Monday (or Sunday if you roll that way). Same for month: "top 10 of the month" means "top 10 items since the beginning of this month.
In this case, you will probably want to store, for each item:
And each week, reset the counts for the current week; And each month reset the counts for the current month.
In DynamoDB, this might be represented like so:
{
id: "<item-id>",
likes_all: <numeric>, // total likes of all time
likes_wk: <numeric>, // total likes for the current week
likes_mo: <numeric>, // total likes for the current month
curr_wk: <numeric>, // number of the current week of year, eg. 27
curr_mo: <numeric>, // number of the current month of year, eg. 6
}
Now, you can update the number of likes with an UpdateItem operation, with an UpdateExpression, like so:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc, likes_wk = likes_wk + :lc, likes_mo = likes_mo + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This gives you a simple atomic way to increment the counts and get back the updated values. Notice the :lc
value can be any number (not just 1). This will come in handy below.
But there's a catch. You also need to be able to reset the counts if the week or month rolled over, so to do that, you can break the update into two operations:
So, our update sequence becomes:
Step 1. update total count and read back the updated item:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This updates the total count and gives us back the state of the item. Based on the values of the curr_wk and curr_mo, you will have to decide what the update looks like. You may be either incrementing, or setting an absolute value. Let's say we're in the case when the update is being performed after the week rolled over, but not the month. And let's say that the result of the update above looks like this:
{
id: "<item-id>",
likes_all: 1000, // total likes of all time
likes_wk: 70, // total likes for the current week
likes_mo: 150, // total likes for the current month
curr_wk: 26, // number of the week of last update
curr_mo: 6, // number of the month of year of last update
}
curr_wk
is 6, but at the time of update, the actual current week should be 7.
Then your update query would look look like this:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET curr_wk = 27, likes_wk = :lc, likes_mo = likes_mo + :lc" \
--condition-expression "curr_wk = :wk AND curr_mo = :mo" \
--expression-attribute-values '{":lc": {"N":"1"}, ":wk": {"N":"26"}, ":lc": {"N":"6"},}' \
--return-values ALL_NEW
The ConditionExpression ensures that we don't reset the likes twice, if two conflicting updates happen at the same time. In that case, one of the updates would fail and you'd have to switch the update back to an increment.
Part 2 - Keeping Track of Statistics
To take care of your statistics you need to keep track of most likes per week and per month.
You can keep a sorted list of hottest items per week and per month. You also can store these lists in Dynamo.
For example, let's say you want to keep track of top 3. You might store something like:
{
id: "item-stats",
week_top: ["item3:4000", "item2:2000", "item9:700"],
month_top: ["item2:100000", "item4:50000", "item3:12000"],
curr_wk: 26,
curr_mo: 6,
sequence: <optimistic-lock-token>
}
Whenever you perform an update for items, you would also update the statistics.
The algorithm for updating statistics will be similar to updating an item, except you can't just use update expressions. Instead you have to implement your own read-modify-write sequence using GetItem, PutItem and ConditionExpression.
First, you read the current values for the item-stats
special item, including the value of the current sequence
(this is important to detect clobbering)
Then, you figure out if the item(s) you've just updated counts for would make it into the Top-N weekly or monthly list. If so, you would update the week_top
and/or month-top
attributes and prepare a conditional PutItem request.
The PutItem request must include a conditional check that verifies the sequeuce
of the item-stats
is the same as what you read earlier. If not, you need to read the item again and re-compute the top-N lists, then attempt to put again.
Also, similar to the way the counts get reset for items, when an update happens you need to check and see if the weekly or monthly top needs to be reset as part of the update.
When you make the PutItem request, make sure to generate a new sequence
value.
Part 3 - Putting It All Together
In Part 1 and Part 2 we figured out how to keep track of likes and keep track of statics but there are big problems with our approach: performance would be pretty bad with any kind of real-life scale; hot items would create problems for us; updating the Top-N stats would be a significant bottleneck.
To improve performance and achieve some scalability we'd want to get away from updating each item and the item-stats for every single "like".
We can achieve a good balance of performance and scalability using a combination of queues + dynamodb + compute resource.
item-stats
By batching updates, we can get control over concurrency (and cost) at the expense of latency/eventual consistency.
We may end up with a limited number of queue consumers, each processing items in batches. In each batch, multiple item likes would be aggregated and a single update per item would be applied. Similarly, a single item-stats
update would be applied per batch processor.
Depending on volume of incoming likes, you may need to spin up more processors.