I have existing data table in cassandra date table with primary key is id
SELECT * FROM Op_History ORDER BY create_time DESC limit 100;
I tried this one but getting error as :
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
primary key is id
So with Cassandra, you need to design your tables to support a specific query. With the PRIMARY KEY being id
, really the only query it will support is each individual row by the id
.
I'd recommend building a query table for that data like this:
CREATE TABLE op_history (
id UUID,
create_time TIMESTAMP,
day_bucket INT,
op_data TEXT,
PRIMARY KEY ((day_bucket),create_time,id))
WITH CLUSTERING ORDER BY (create_time DESC, id ASC);
By partitioning on day_bucket
, I'm ensuring that all data for a specific day is stored together. I'm not sure about the business case behind op_history
, but if you need to query for an entire month's worth of data, then you would use something like month_bucket
instead.
Now, I can filter on rows for a specific day:
> SELECT * FROM op_history WHERE day_bucket=20221221;
day_bucket | create_time | id | op_data
------------+---------------------------------+--------------------------------------+---------
20221221 | 2022-12-21 14:42:58.552000+0000 | 59b0a30b-213b-4847-bd3e-134a641be21f | Hello4!
20221221 | 2022-12-21 14:42:56.057000+0000 | 7148d5b3-77d7-4088-8c6d-f2e4c73175f2 | Hello3!
20221221 | 2022-12-21 14:42:53.866000+0000 | b23f4556-2a72-4014-a6e9-7a2ceb55217c | Hello2!
20221221 | 2022-12-21 14:42:47.738000+0000 | 51d09afa-806e-4bec-b6bf-94eb1a67910d | Hello!
(4 rows)
With the CLUSTERING ORDER
defined, I won't need an ORDER BY
clause.
As I dont have chance to change table creation
Oh, I'm not suggesting that. I'm suggesting that you create a new table with a different primary key definition, and load the same data into it. That's actually the best practice for data modeling in Cassandra.
is there any possibilities i.e.. ALLOW FILTERING
So using the ALLOW FILTERING
directive is generally considered to be "bad practice," because it consumes too many resources. If the query has to talk to too many nodes, it could likely time out or even crash the coordinator node. Also, ALLOW FILTERING
still won't let an ORDER BY
to be applied to it.
One thing a lot of teams end up doing is building a Spark cluster to work with Cassandra data. Spark can pull data from Cassandra and work with it in RAM to perform ANSI-compliant SQL operations on it. That would allow you to apply an ORDER BY
.
On the other hand, you could try ALLOW FILTERING
and then perform the sort on the application side. Definitely not ideal.