cassandra apache-spark spark-streaming spark-cassandra-connector

write times in cassandra using spark-cassandra connector

I have this use case where I would need to constantly listen to a kafka topic and write to 2000 column families(15 columns each.. time series data) based on a column value from a Spark streaming App. I have a local Cassandra installation set up. Creating these column families takes around 1.5 hrs on a CentOS VM using 3 cores and and 12 gigs of ram. In my spark streaming app I'm doing some preprocessing for storing these stream events to Cassandra. I'm running into issues with the amount of time it takes for my streaming app to complete this.
I was trying to save 300 events to multiple column families(roughly 200-250) based on key for this my app takes around 10 minutes to save them. This seems to be strange as printing these events to screen grouped by key takes less than a minute, but only when I am saving them to Cassandra it takes time. I have had no issues saving records in the order of 3 million to Cassandra . It took less than 3 minutes(but this was to a single column family in Cassandra).

My requirement is to be as real-time as possible and this seems to be nowhere close. Production environment would have roughly 400 events every 3 seconds.

Is there any tuning that i need to do With the YAML file in Cassandra or any changes to cassandra-connector itself

INFO  05:25:14 system_traces.events                      0,0
WARN  05:25:14 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:14 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:15 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:16 ParNew GC in 340ms.  CMS Old Gen: 1308020680 -> 1454559048; Par Eden Space: 251658240 -> 0; 
WARN  05:25:16 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:16 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:17 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:17 ParNew GC in 370ms.  CMS Old Gen: 1498825040 -> 1669094840; Par Eden Space: 251658240 -> 0; 
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:18 Read 2124 live and 4248 tombstoned cells in system.schema_columnfamilies (see tombstone_warn_threshold). 2147483639 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
WARN  05:25:19 Read 33972 live and 70068 tombstoned cells in system.schema_columns (see tombstone_warn_threshold). 2147483575 columns was requested, slices=[-]
INFO  05:25:19 ParNew GC in 382ms.  CMS Old Gen: 1714792864 -> 1875460032; Par Eden Space: 251658240 -> 0; 
W

Solution

I suspect you're hitting edge cases in cassandra related to the large number of CFs/columns defined in the schema. Typically when you see tombstone warnings, it's because you've messed up the data model. However, these are in system tables, so obviously you've done something to the tables that the authors didnt expect (lots and lots of tables, and probably drop/recreating them a lot).

Those warnings were added because scanning past tombstones looking for live columns causes memory pressure, which causes GC, which causes pauses, which causes slowness.

Can you squish the data into significantly fewer column families? You may also want to try clearing out the tombstones (drop gcgs for that table to zero, run major compaction on system if it's allowed?, raise it back to default).