We have a requirement to record time series data in cassandra
Below data modeling is able to meet the functional requirement.
CREATE TABLE sample_times (c timeuuid ,a varchar, PRIMARY KEY (c));
INSERT INTO sample_times (c,a) VALUES ( now(), 'course1');
INSERT INTO sample_times (c,a) VALUES ( now(), 'course2');
INSERT INTO sample_times (c,a) VALUES ( now(), 'course3');
INSERT INTO sample_times (c,a) VALUES ( now(), 'course0');
INSERT INTO sample_times (c,a) VALUES ( now(), 'course5');
INSERT INTO sample_times (c,a) VALUES ( now(), 'course4');
INSERT INTO sample_times (c,a) VALUES ( now(), 'course0');
to select the data out, we can select it by a small range that only limit number of lines can be found in that range(i.e, 5 minutes).
SELECT * FROM sample_times
WHERE c > maxTimeuuid('2023-05-03 00:05+0000')
AND c < minTimeuuid('2023-05-03 00:10+0000') allow filtering;
But every time we perform this, we have to add an allow filtering that means the clause may have performance impact. we do not want to add some other primary key like workstation_id etc to group the time series data since it is not convenient.
I don't know how cassandra handle this in implementation. Does it practical in performance to use time uuid as first primary key in a large table?
Thanks~
expect the result can be selected quickly in a large table by providing just small time range.
The question isn't whether timeUUIDs are a good choice for partition keys. The question is whether it is a good idea to read data from Cassandra using range queries on the partition key and answer is no.
You choose Cassandra because you have a scale problem. To deal with your scale problem, you have to optimise the reads by clustering rows of data into partitions so that you only need to read ONE partition to get all the data you need.
Range queries on partition keys do not scale because by definition, Cassandra has to send multiple request to multiple nodes to retrieve each partition required to satisfy the range -- it's a scatter/gather access pattern.
You haven't posted a lot of detail about your use case but what you have doesn't look like time-series to me. You don't typically store data about time in a time-series -- you store data about some entity that has a time component to it.
For example, here's a data model for devices that measure temperature:
CREATE TABLE devices (
device_id text,
time_recorded timestamp,
temperature int,
PRIMARY KEY (device_id, time_recorded)
) WITH CLUSTERING ORDER BY (time_recorded DESC)
You can then get the range of temperatures recorded by a device with:
SELECT temperature FROM devices
WHERE device_id = ?
AND time_recorded > ?
AND time_recorded < ?
With this model, you are just retrieving the data from one partition so it would be very fast. Cheers!