I need to get a count from a very large dataset in Cassandra, 100 million plus. I am worried about the memory hit cassandra would take if I just ran the following query.
select count(*) from conv_org where org_id = 'TEST_ORG'
I was told I could use cassandra Automatic Paging to do this? Does this seem like a good option?
Would the syntax look something like this?
Statement stmt = new SimpleStatement("select count(*) from conv_org where org_id = 'TEST_ORG'");
stmt.setFetchSize(1000);
ResultSet rs = session.execute(stmt);
I am unsure the above code will work as I do not need a result set back I just need a count.
Here is the data model.
CREATE TABLE ts.conv_org (
org_id text,
create_time timestamp,
test_id text,
org_type int,
PRIMARY KEY (org_id, create_time, conv_id)
)
If org_id
isn't your primary key counting in cassandra in general is not a fast operation and can easily lead to a full scan of all sstables in your cluster and therefore be painfully slow.
In Java for example you can do something like this:
ResultSet rs = session.execute(...);
Iterator<Row> iter = rs.iterator();
while (iter.hasNext()) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults();
Row row = iter.next()
... process the row ...
}
https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/ResultSet.html
You could select a small colum and count your self. There is int getAvailableWithoutFetching()
and isFullyFetched()
that could help you.
In general if you really need a count - maintain it yourself.
On the other hand, if you have really many rows in one partition you can have also some other performance problems.
But that's hard to say without knowing the data model.