Search code examples
cassandrahectorpelops

Counting columns, very slow CountQuery vs SliceQuery operations


I've written a "census" program to iterate through all the rows in a Column Family and within each row count the columns, recording the max value and row key. I've been spending more time with the Hector client but have written a Pelops client as well to test.

The basic flow is to use use a RangeSlicesQuery to iterate through the rows, and then at each row, use a SliceQuery to iterate through and collect the stats. Works similar in Pelops, just different APIs. Downside is having to do the buffering manually, picking buffer sizes for both rows and columns... My current data is 12 million rows, with largest column count ~25K, so yeah takes a while... in my current configuration, am getting >25K rows per second.

Looking for ways to improve and discovered Hector's CountQuery (which I assume, uses Thrift client get_count()). Thinking it would be faster to just iterate keys (use RangeSlicesQuery.setReturnKeysOnly()), and then re-use a CountQuery on each row key, I revised the code.

Not only was it slower, but 30x slower! (processed only 900 rows per second)...

Is there a better way to count columns?


Solution

  • Not sure what's going on with Hector -- I'd expect it to be roughly 2x slower, not 30x slower.

    More generally, keeping a denormalized count using a counter column is probably better than a full CF scan: http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-2-counters