I want to return the first N rows from a Cassandra database filtering on some criterion, where the filtering is done on ordinary (not clustering) columns.
Let's assume a simple table like this:
CREATE TABLE test(
id UUID,
timestamp TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC)
Option 1
SELECT timestamp, value FROM test WHERE id=? AND value<? LIMIT ? ALLOW FILTERING
This is allowed, but ALLOW FILTERING
is generally to be avoided. Having said that, is it really that bad if the query touches only the one partition?
Option 2 Set a very small paging size, e.g. N*10 (say) and then:
SELECT timestamp, value FROM test WHERE id=?
Read the results a page at a time, and stop reading as soon as sufficient suitable rows have been read. Is there any cost associated with the pages that have not yet been fetched? If not I'd guess this is the clear winner.
Option 3 Default paging, LIMIT the number of results to N*10, issue a new query if insufficient suitable rows returned:
SELECT timestamp, value FROM test WHERE id=? AND timestamp>? LIMIT ?
If there are insufficient suitable rows in the results, issue a new query starting just after the previous query result's last timestamp
.
I'd like to know what is likely to be the best option.
I did some rough-and-ready benchmarking. To my surprise, I found that the ALLOW FILTERING
option was orders of magnitude faster, at least in my test scenario. The other two options were heavily dependent on the LIMIT
or page size, with smaller a LIMIT
/page performing very much worse.
If the first suitable row is found in the first page/first query result then the three options are not far off comparable, but ALLOW FILTERING
is still fastest.
The biggest surprise to me was that paging through results of a single large query performs little better than serial execution (i.e. non-concurrent) of multiple small queries. Could it be that each time the driver requests the next page of results, Cassandra in effect executes a new query for that page?
Clearly, these conclusions are heavily biased by the dataset being queried. However, the superiority of ALLOW FILTERING
was so stark that I'd make the working assumption that this will be applicable in almost all cases.