Search code examples
scalaapache-sparkcassandrainsert-updatespark-cassandra-connector

Spark Cassandra append dataset to table with null values


I use DataStax Spark connector to fill a Cassandra cluster and process data in different jobs (due to some unsupported operations by Spark for streaming processing, such as double aggregation). So I want to store data in the same table for different jobs. Assuming that a first streaming job inserts a row in this table (using a foreach writer, because the connector doesn't support streamed writing yet).

INSERT INTO keyspace_name.table_name (id, col1, col2) VALUES ("test", 1, null);

What if I append (upsert) a dataset with a null column in it where there was already a non-null value for that row in Cassandra ?

// One row of the dataset = "test", null, 2
dataset.write
  .format("org.apache.spark.sql.cassandra")
    .option("keyspace", keyspace)
  .option("table", table)
  .mode(SaveMode.Append)
  .save()

If I understand the docs correctly, the previous non-null value will be overwritten by the new null value ? If so, is there a way to keep existing non-null values ? Or do I have to store the data in separate tables for each job ?


Solution

  • Yes. Non Null values will be overwritten by null.

    To avoid this behavior use spark.cassandra.output.ignoreNulls = true. This will cause all null values to be left as unset rather than bound. Write Tuning Parameters