Search code examples
cassandrascylla

Timeseries log data with levels in ScyllaDb/Cassandra


I need a way to query log data by level (think log,debug,error) and by timestamp. So that I can retrieve information after a provided time given a level, set of levels or by all levels.

Proposed solution:

CREATE KEYSPACE my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}
AND durable_writes = true
AND partitioner = 'org.apache.cassandra.dht.ByteOrderedPartitioner';

CREATE TABLE my_keyspace.logs (
    timestamp TIMESTAMP,     // Partition key
    level TEXT,              // Clustering key
    source_node_id TEXT,     // Clustering key to handle collisions
    message TEXT,            // Log message content
    PRIMARY KEY ((timestamp), level, source_node_id) 
) WITH CLUSTERING ORDER BY (level ASC, source_node_id ASC);

SELECT * FROM logs 
WHERE TOKEN(timestamp) > TOKEN('2024-10-01T00:00:00Z') 
AND level IN ('error', 'warn');

Here I rely on ByteOrderedPartitioning to ensure the data is well ordered by timestamp, such that I can receive the data in an ordered fashion. I use a clustering key for the level so that I can use the IN clause. Although I'm not totally sure this is possible to use the IN clause after a TOKEN greater than. And I use a "source_node_id" to ensure I do not overwrite logging data if it happens at the same time on a different node, each logging entity will have a unique identifier.

Potential issues identified: due to the ByteOrdering if the amount of logging is very uneven over time then that will affect how evenly the data is distributed over the node, although I think I can live with this. So long as over a long period of time data acculumates largely evenly around the token range with the exception of occasional hotspots. We can assume that in normal operation the amount of logging will be normally distributed.

I will be using the latest versions of Cassandra or ScyllaDB to implement this logging service and don't care about older versions behaviour.

I'm wondering A if this is a effective solution, B can an IN clause be used after than a TOKEN greater than clause and C could better use of indicies help with such a problem and provided a better solution?


Solution

  • The ByteOrderedPartitioner is not supported by ScyllaDB, and the reason why it's not supported is exactly what you asked about: It results in very skewed data distributions. So you shouldn't build your solution on that partitioner.

    The better data modeling for size time-series is to use the clustering key, not the partition key, to sort by time. But, you don't want to have all the data in a single partition so what you often do to split the partitions into buckets of, say, individual days (i.e., the day number is the partition key), and inside each partition the data is sorted by the clustering key - the accurate time. In this approach the data is well-balanced of the disk, but the write activity is still only to a single partition and therefore just RF (e.g., 3) CPUs of the cluster for an entire day, so it makes sense to add to the partition key another component so that new activity will be spread across, say, 10 partitions instead of one (and of course, read will need to be done from these 10 partitions). You can make the "level" part of the clustering key or the partition key, as you choose - if a common need is to scan, for example, just the ERROR level events, it is more efficient to make it part of the partition key - this allows you to efficiently scan just the ERROR-level events of a single day without needed to filter out other types of events.

    Time-series modeling with ScyllaDB or Cassandra is quite popular, so you can easily find in a Google search various documents and presentations on how to do it effectively - how to model the data, which compaction strategy to use (time-window compaction strategy), and so on.