apache-spark cassandra spark-cassandra-connector

Perfomance related question on spark + cassandra (JAVA code)

i am using cassandra as my dumping ground on which i have multiple jobs running to process the data and update different system. below are the job related filters

Job 1. data filter based on active_flag and update_date_time and expiry_time and process the filtered data.

Job 2. data filter based on update_date_time process the data

Job 3. data filter based on created_date_time and active flag

db columns on which where condition would run are (one or many columns in one query)

Active -> yes/no
created_date -> timestamp
expiry_time -> timestamp
updated_date -> timestamp

My question on these conditions :-

how should i form my cassandra primary key? as i dont see any way to acheive uniqueness on this (id is present but thats not required for me to process data).
do i even need the primary key if i use the filtering on spark code using table scan?

considering this for millions of records processing.

Solution

Answering to your question - you need to have a primary key, even if it consists only of the partition key :-)

More detailed answer really depends on how often these jobs are running, how much data overall, how many nodes in the cluster, what hardware is used, etc. Usually, we're trying to push as much filtering to Cassandra as possible, so it will return only relevant data, not everything. The most effective this filtering happens on the first clustering column, for example, if I want to process only newly created entries, then I can use the table with following structure:

create table test.test (
  pk int,
  tm timestamp,
  c2 int,
  v1 int,
  v2 int,
  primary key(pk, tm, c2));

and then I can fetch only newly created entries by using:

import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("test", "test").load()
val filtered = data.filter("tm >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")

Or I can fetch entries in the given time period:

val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
  AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")

The effect of the filter pushdown could be checked by executing explain on the dataframe, and checking the PushedFilters section - conditions that are marked with * will be executed on Cassandra side...

But it's not always possible to design tables to match all queries, so you'll need to design primary key for jobs that are executed most often. In your case, update_date_time could be a good candidate for that, but if you put it as clustering column, then you'll need to take care when updating it - you'll need to perform change as batch, something like this:

begin batch
delete from table where pk = ... and update_date_time = old_timestamp;
insert into table (pk, update_date_time, ...) values (..., new_timestamp, ...);
apply batch;

or something like this.