cassandra datastax datastax-enterprise nosql

Slow Select from big table in Cassandra

I have table like this in Cassandra(2.1.15.1423) with over the 14 000 000 records:

CREATE TABLE keyspace.table (
    field1 text,
    field2 text,
    field3 text,
    field4 uuid,
    field5 map<text, text>,
    field6 list<text>,
    field7 text,
    field8 list<text>,
    field9 list<text>,
    field10 text,
    field11 list<text>,
    field12 text,
    field13 text,
    field14 text,
    field15 list<frozen<user_defined_type>>,
    field16 text,
    field17 text,
    field18 text,
    field19 text,
    PRIMARY KEY ((field1, field2, field3) field4)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

In application I use Python (cassandra-driver==3.1.1) and Go (gocql).

Problem:

I need to move records from this table to another. When I try to get data (even without filters) all stops and I get timeout error. I tried to change fetch_size/page_size - result the same but after few minutes of waiting.

Solution

If you are going to move records from this table to a different table you should do this one partition range at a time. Doing something similar to a

SELECT * FROM keyspace.table

will not work in a highly distributed datastore such as Cassandra. This is becasue a query like the one above requires a full cluster scan and scatter/gather operation to be performed in order to satisfy it. This is an anti-pattern in C* and will cause timeouts in most cases. A better approach is to only query one partition at a time. This data can be retrieved very quickly by the data store. A common pattern for this sort of operation is to iterate through the token ranges of the table one at a time and process each one individually.Here is an example (sorry it is in Java) of how you can slice the token ranges in Cassandra to only deal with a small subset of the data at a time:

https://github.com/brianmhess/cassandra-count