scala apache-spark cassandra nosql spark-cassandra-connector

Insert Spark Dataset[(String, Map[String, String])] to Cassandra Table

I have a Spark Dataset of type Dataset[(String, Map[String, String])].

I have to insert the same into a Cassandra table.

Here, key in the Dataset[(String, Map[String, String])] will become my primary key of the row in Cassandra.

The Map in the Dataset[(String, Map[String, String])] will go in the same row in a column ColumnNameValueMap.

The Dataset can have millions of rows.

I also want to do it in optimum way (e.g. batch insert Etc.)

My Cassandra table structure is:

CREATE TABLE SampleKeyspace.CassandraTable (
  RowKey text PRIMARY KEY,
  ColumnNameValueMap map<text,text>
);

Please suggest how to do the same.

Solution

Everything that you need is just to use Spark Cassandra Connector (better to take version 2.5.0 that was just released). It provides read & write functions for datasets, so in your case it will be just

import org.apache.spark.sql.cassandra._
your_data.write.cassandraFormat("CassandraTable", "SampleKeyspace").mode("append").save()

If your table don't exist yet, then you can create it base don the structure of the dataset itself - there are 2 functions: createCassandraTable & createCassandraTableEx - it's better to use 2nd, as it provides more control over table creation.

P.S. You can find more about 2.5.0 release in the following blog post.