Search code examples
scalaapache-sparkcassandrasbtspark-cassandra-connector

How does DataStax Spark Cassandra connector create SparkContext?


I have run the following Spark test program successfully. In this program I notice the "cassandraTable" method and "getOrCreate" method in SparkContext class. But I am not able to find it in the Spark Scala API docs for this class. What am I missing in understanding this code? I am trying to understand how this SparkContext is different when Datastax Connector is in sbt.

Code -

import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._

object CassandraInt {

def main(args:Array[String]){

   val SparkMasterHost = "127.0.0.1"
   val CassandraHost = "127.0.0.1"
   val conf = new SparkConf(true)
    .set("spark.cassandra.connection.host", CassandraHost)
    .set("spark.cleaner.ttl", "3600")
    .setMaster("local[12]")
    .setAppName(getClass.getSimpleName)

   // Connect to the Spark cluster:
   lazy val sc = SparkContext.getOrCreate(conf)

   val rdd = sc.cassandraTable("test", "kv")
   println(rdd.count)
   println(rdd.map(_.getInt("value")).sum)    
  }}

The build.sbt file I used is -

name := "Test Project"
version := "1.0"
scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"

addCommandAlias("c1", "run-main CassandraInt")

libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M3"

fork in run := true

Solution

  • It is not different. Spark supports only one active SparkContext and getOrCreate is a method defined on the companion object:

    This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.

    This method allows not passing a SparkConf (useful if just retrieving).

    To summarize:

    • If there is an active context it returns it.
    • Otherwise it creates a new one.

    cassandraTable is a method of the SparkContextFunctions exposed using an implicit conversion.