Search code examples
python-2.7cassandracqlsh

How to reliably copy Cassandra database between two server?


I have a test setup where I would like to have a copy of master data.

I'm using Cassandra package from datastax, version 3.0.9

I'm using CQLSH to take a dump of data, and restore in on test setup. I'm taking a copy of the master data by using

COPY TO WITH DELIMITER = '\t' AND NULL = 'null' AND QUOTE = '"' AND HEADER = True

I'm populating the data using

COPY FROM WITH DELIMITER = '\t' AND NULL = 'null' AND QUOTE = '"' AND HEADER = True

After COPY_FROM, CQLSH says that it successfully copied all the rows from the file. But when I run a count(*) on the table, a few rows are missing. There is no particular pattern to missing rows. If I replay the command after truncating the table, a new set of rows go missing. The count of the missing rows is random.

The table structure contains lists/sets of User Defined Datatypes, with possible 'null' values in the contents of UDT.

Is there any other reliable way to copy data, other than to programmatically read and write individual rows between the two database?


Schema of the table (field names changed):

CREATE TYPE UDT1 (
   field1 text,
   field2 int,
   field3 text
);

CREATE TYPE UDT2 (
   field1 boolean,
   field2 float
);

CREATE TABLE cypher.table1 (
   id int PRIMARY KEY,
   list1 list<frozen<UDT1>>,
   data text,
   set1 set<frozen<UDT2>>
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

Solution

  • Aside of exporting/importing the data you can try and copy the data itself.

    1. Take a snapshot of the data from the original cluster using "nodetool snapshot" https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSnapShot.html.
    2. Create the schema on the test cluster
    3. Load the snapshot from the original cluster onto the test cluster:

      a. if all the nodes in the test hold all the data (single node / 3 node rf=3) - or the amount of data is small - copy the files from the original cluster to the keyspace/column_family directory and do nodetool refresh (https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRefresh.html) - make sure to not overlap files

      b. if the test cluster nodes do not hold all the data / the amount of data is large - use sstablloader (https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsBulkloader.html) to stream the files from the snapshot onto the test cluster