Search code examples
cassandracql

CQL (cassandra) - Select only the rows with maximum value in one of the columns


I need to find the rows with given stationid, that have time1 greater than specified time and maximal time2.

The table is created like this:

CREATE TABLE forec (
    stationid int,
    time1 timestamp,
    time2 timestamp,
    value double,
    PRIMARY KEY ((stationid), time1, time2)
) WITH CLUSTERING ORDER BY (time1 DESC)

Lets suppose the data in the table is like this:

    +------------+-----------------------+----------------------+--------+
    | stationid  | time1                 |  time2               |  value |
    +------------+-----------------------+----------------------+--------+
    | 1          | 2020-10-21 06:00:00   | 2020-10-21 05:00:00  | 1      |                                  
    | 1          | 2020-10-21 06:00:00   | 2020-10-21 04:00:00  | 2      |                                   
    | 1          | 2020-10-21 06:00:00   | 2020-10-21 03:00:00  | 3      |                                   
    | 1          | 2020-10-21 05:00:00   | 2020-10-21 04:00:00  | 4      |
    | 1          | 2020-10-21 05:00:00   | 2020-10-21 03:00:00  | 5      |
    | 1          | 2020-10-21 04:00:00   | 2020-10-21 02:00:00  | 6      |
    +------------+-----------------------+----------------------+--------+

I would like to query: Give me all the rows where stationid = 1 and time1 >= 2020-10-21 05:00:00 and time2 has maximum value. The query should return the following rows:

    +------------+-----------------------+----------------------+--------+
    | stationid  | time1                 |  time2               |  value |
    +------------+-----------------------+----------------------+--------+
    | 1          | 2020-10-21 06:00:00   | 2020-10-21 05:00:00  | 1      |        
    | 1          | 2020-10-21 05:00:00   | 2020-10-21 04:00:00  | 4      | 
    +------------+-----------------------+----------------------+--------+

I know I could query like:

SELECT * FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00';

and then filter the results on client (and keep only the rows that have maximum time2), however I would like to know if this can be done more efficiently (filtering results on the Cassandra side).

Or maybe I should change the table model?


Solution

  • Edit: According to Cassandra document, "If a column is selected without an aggregate function, in a statement with a GROUP BY, the first value encounter in each group will be returned." So below example only works when time2 is stored in DESC order.

    If you are using recent version of Cassandra (like 3.11.x) then you can use GROUP BY to do things like

    SELECT
      stationid,
      time1,
      max(time2) AS max_time2,
      value
    FROM
      forec
    WHERE
      stationid = 1
    AND
      time1 >= '2020-10-21 05:00:00'
    GROUP BY time1;
    

    and you get

    cqlsh:test> SELECT stationid, time1, max(time2) as max_time2, value FROM forec WHERE stationid = 1 AND time1 >= '2020-10-21 05:00:00' GROUP BY  time1;
    
     stationid | time1                           | max_time2                       | value
    -----------+---------------------------------+---------------------------------+-------
             1 | 2020-10-21 06:00:00.000000+0000 | 2020-10-21 05:00:00.000000+0000 |     1
             1 | 2020-10-21 05:00:00.000000+0000 | 2020-10-21 04:00:00.000000+0000 |     4
    
    (2 rows)
    

    Note that this scans your partition so be aware of partition size, especially when you use timestamp in clustering column.