I use DataStax Spark connector to fill a Cassandra cluster and process data in different jobs (due to some unsupported operations by Spark for streaming processing, such as double aggregation). So I want to store data in the same table for different jobs. Assuming that a first streaming job inserts a row in this table (using a foreach writer, because the connector doesn't support streamed writing yet).
INSERT INTO keyspace_name.table_name (id, col1, col2) VALUES ("test", 1, null);
What if I append (upsert) a dataset with a null column in it where there was already a non-null value for that row in Cassandra ?
// One row of the dataset = "test", null, 2
dataset.write
.format("org.apache.spark.sql.cassandra")
.option("keyspace", keyspace)
.option("table", table)
.mode(SaveMode.Append)
.save()
If I understand the docs correctly, the previous non-null value will be overwritten by the new null value ? If so, is there a way to keep existing non-null values ? Or do I have to store the data in separate tables for each job ?
Yes. Non Null values will be overwritten by null.
To avoid this behavior use spark.cassandra.output.ignoreNulls = true
. This will cause all null values to be left as unset rather than bound.
Write Tuning Parameters