Search code examples
cassandranosql

Cassandra poor performance when aggregating over 40 columns in a 500K partition


I have a cassandra table simillar to the below code

CREATE TABLE IF NOT EXISTS timebased_dev.sample_table (
    operation_id uuid,
    job_id uuid,
    timestamp timestamp,
    depth double,
    depth_is_null int,
    c0 double,
    c0_is_null int,
    c1 double,
    c1_is_null int,
    c2 double,
    c2_is_null int,
    ...
    c499 double,
    c499_is_null int,
    PRIMARY KEY ((operation_id, job_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

I would like to group by the timestamp using floor. I am expecting data every 500ms and I will have around 500k rows every three days. I would like to group by 5 minute intervals and do average for each column. Currently it takes around 100 seconds to do that. What am I doing wrong?

I tried using the built in function AVG and SUM. I need the average for some columns and the sum for others. I am not doing allow filtering. I have a single partition for the 500k rows. Should I use multiple paritions? Will that improve my query?

Here is a sample query I intend to run

SELECT floor(timestamp, 5m),
  AVG(c0),
  SUM(c0_is_null),
  AVG(c1),
  SUM(c1_is_null),
  AVG(c2),
  SUM(c2_is_null),
  AVG(c3),
  SUM(c3_is_null),
  AVG(c4),
  SUM(c4_is_null),
  AVG(c5),
  SUM(c5_is_null),
  AVG(c6),
  SUM(c6_is_null),
  AVG(c7),
  SUM(c7_is_null),
  AVG(c8),
  SUM(c8_is_null),
  AVG(c9),
  SUM(c9_is_null),
  AVG(c10),
  SUM(c10_is_null),
  AVG(c11),
  SUM(c11_is_null),
  AVG(c12),
  SUM(c12_is_null),
  AVG(c13),
  SUM(c13_is_null),
  AVG(c14),
  SUM(c14_is_null),
  AVG(c15),
  SUM(c15_is_null),
  AVG(c16),
  SUM(c16_is_null),
  AVG(c17),
  SUM(c17_is_null),
  AVG(c18),
  SUM(c18_is_null),
  AVG(c19),
  SUM(c19_is_null)
FROM timebased_dev.sample_table
WHERE operation_id = e2c5f52e-e3c1-40f8-aa96-ed8af74237b0
  AND job_id = 47913ea2-8c7b-44d3-a81f-46fbcd0c4bc9
  AND timestamp > '2024-01-15 22:34:51.9050+07:00'
  AND timestamp < '2024-01-18 22:34:51.9050+07:00'
GROUP BY floor(timestamp, 5m);

Solution

  • The behaviour you are seeing is expected. Your query is not OLTP but analytics in nature and very, very expensive considering the following factors:

    • it is aggregating over 40 columns
    • of a partition with over 500 columns
    • and 500K rows.

    You should consider using Spark for analytics queries since it will be more efficient. Regardless, it will still be costly to run your query given the dataset size. Cheers!