Search code examples
scalaapache-sparkcassandranosqlspark-cassandra-connector

Insert Spark Dataset[(String, Map[String, String])] to Cassandra Table


I have a Spark Dataset of type Dataset[(String, Map[String, String])].

I have to insert the same into a Cassandra table.

Here, key in the Dataset[(String, Map[String, String])] will become my primary key of the row in Cassandra.

The Map in the Dataset[(String, Map[String, String])] will go in the same row in a column ColumnNameValueMap.

The Dataset can have millions of rows.

I also want to do it in optimum way (e.g. batch insert Etc.)

My Cassandra table structure is:

CREATE TABLE SampleKeyspace.CassandraTable (
  RowKey text PRIMARY KEY,
  ColumnNameValueMap map<text,text>
);

Please suggest how to do the same.


Solution

  • Everything that you need is just to use Spark Cassandra Connector (better to take version 2.5.0 that was just released). It provides read & write functions for datasets, so in your case it will be just

    import org.apache.spark.sql.cassandra._
    your_data.write.cassandraFormat("CassandraTable", "SampleKeyspace").mode("append").save()
    

    If your table don't exist yet, then you can create it base don the structure of the dataset itself - there are 2 functions: createCassandraTable & createCassandraTableEx - it's better to use 2nd, as it provides more control over table creation.

    P.S. You can find more about 2.5.0 release in the following blog post.