I have Kafka streams containing interactions of users with a website, so every event has a timestamp and information about the event. For each user I want to store the last K events in Cassandra (e.g. 100 events).
Our website is constantly experiencing bot / heavy users that is why we want to cap events, just to consider "normal" users.
I currently have the current data model in Cassandra:
user_id, event_type, timestamp, event_blob
where
<user_id, event_type> = partition key, timestamp = clustering key
For now we write a new record in Cassandra as soon as a new event happens and later on we go and clean up "heavier" partitions (i.e. count of events > 100). This doesn't happen in real time and until we don't clean up the heavy partitions we sometimes get bad latencies when reading.
Do you have any suggestions of a better table design for such case? Is there a way to tell Cassandra to store only at most K elements for partition and expire the old ones in a FIFO way? Or is there a better table design that I can opt for?
Do you have any suggestions of a better table design for such case?
When data modeling for scenarios like this, I recommend a pattern that makes use of three things:
TTL:
later on we go and clean up "heavier" partitions
How long (on average) before the cleanup happens? One thing I would do, is to use a TTL on that table set to somewhere around the maximum amount of time before your team usually has to clean them up.
Clustering Key, Descending Order:
So your PRIMARY KEY definition looks like this:
PRIMARY KEY ((user_id,event_type),timestamp)
Make sure that you're also clustering in a descending order on timestamp.
WITH CLUSTERING ORDER BY (timestamp DESC)
This is important to use in conjunction with your TTL. Here, your tombstones are on the "bottom" of the partition (when sorting on timestamp
descinding) and the recent data (the data you care about) is at the "top" of the partition.
Range Query:
Finally, make sure your query has a range component on the timestamp
.
For example: if today is the 11th, and my TTL is 5 days, I can then query the last 4 days of data without pulling back tombstones:
SELECT * FROM events
WHERE user_id = 11111 AND event_type = 'B'
AND timestamp > '2020-03-07 00:00:00';