I need to find the rows with given stationid, that have time1 greater than specified time and maximal time2.
The table is created like this:
CREATE TABLE forec (
stationid int,
time1 timestamp,
time2 timestamp,
value double,
PRIMARY KEY ((stationid), time1, time2)
) WITH CLUSTERING ORDER BY (time1 DESC)
Lets suppose the data in the table is like this:
+------------+-----------------------+----------------------+--------+
| stationid | time1 | time2 | value |
+------------+-----------------------+----------------------+--------+
| 1 | 2020-10-21 06:00:00 | 2020-10-21 05:00:00 | 1 |
| 1 | 2020-10-21 06:00:00 | 2020-10-21 04:00:00 | 2 |
| 1 | 2020-10-21 06:00:00 | 2020-10-21 03:00:00 | 3 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 04:00:00 | 4 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 03:00:00 | 5 |
| 1 | 2020-10-21 04:00:00 | 2020-10-21 02:00:00 | 6 |
+------------+-----------------------+----------------------+--------+
I would like to query: Give me all the rows where stationid = 1 and time1 >= 2020-10-21 05:00:00 and time2 has maximum value. The query should return the following rows:
+------------+-----------------------+----------------------+--------+
| stationid | time1 | time2 | value |
+------------+-----------------------+----------------------+--------+
| 1 | 2020-10-21 06:00:00 | 2020-10-21 05:00:00 | 1 |
| 1 | 2020-10-21 05:00:00 | 2020-10-21 04:00:00 | 4 |
+------------+-----------------------+----------------------+--------+
I know I could query like:
SELECT * FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00';
and then filter the results on client (and keep only the rows that have maximum time2), however I would like to know if this can be done more efficiently (filtering results on the Cassandra side).
Or maybe I should change the table model?
Edit: According to Cassandra document, "If a column is selected without an aggregate function, in a statement with a GROUP BY, the first value encounter in each group will be returned." So below example only works when time2
is stored in DESC
order.
If you are using recent version of Cassandra (like 3.11.x) then you can use GROUP BY
to do things like
SELECT
stationid,
time1,
max(time2) AS max_time2,
value
FROM
forec
WHERE
stationid = 1
AND
time1 >= '2020-10-21 05:00:00'
GROUP BY time1;
and you get
cqlsh:test> SELECT stationid, time1, max(time2) as max_time2, value FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00' GROUP BY time1;
stationid | time1 | max_time2 | value
-----------+---------------------------------+---------------------------------+-------
1 | 2020-10-21 06:00:00.000000+0000 | 2020-10-21 05:00:00.000000+0000 | 1
1 | 2020-10-21 05:00:00.000000+0000 | 2020-10-21 04:00:00.000000+0000 | 4
(2 rows)
Note that this scans your partition so be aware of partition size, especially when you use timestamp in clustering column.