Search code examples
scalaapache-sparkcassandradatastax-enterprisespark-cassandra-connector

Why does Spark application takes longer to execute when reading dataset from Cassandra table than local file?


I have the following code, and the application ends immediately after generating the result.

  def textProcess(sc: SparkContext) {

    val baseRDD = sc.textFile("C:\\myDrive\\test.log")    
    val result = baseRDD.map { x => x }.reduce((accum, current) => accum)
    println(result)
    sc.close()   
  }

But when I run the below code against Cassandra with spark-cassandra-connector,the application ends only after some 10 seconds delay.

  def dbProcess(sc: SparkContext) {

    val baseRDD = sc.cassandraTable("local_test", "configurations")
    val result = baseRDD.map { x => x.getString("keyname") }.reduce((accum,current) => accum)
    println(result)
    sc.close()    
  }

Version Details

Spark version is 1.6.x

   <dependency>
        <groupId>com.datastax.spark</groupId>
        <artifactId>spark-cassandra-connector_2.10</artifactId>
        <version>1.6.0</version>
    </dependency>

    <dependency>
        <groupId>com.datastax.cassandra</groupId>
        <artifactId>dse-driver</artifactId>
        <version>1.1.0</version>
    </dependency>

    <dependency>
        <groupId>com.datastax.cassandra</groupId>
        <artifactId>cassandra-driver-core</artifactId>
        <version>3.0.2</version>
    </dependency> 

My question here is , why this delay when dealing with spark-cassandra-connector? is there any way to avoid this delay? or is this a version problem?(I tried with few other versions but the result is intact)


Solution

  • why this delay when dealing with spark-cassandra-connector?

    Basically, the difference boils down to the following two lines:

    sc.textFile("C:\\myDrive\\test.log")
    

    and

    sc.cassandraTable("local_test", "configurations")
    

    The former is a relatively cheap access to a local file while the latter accesses a remote Cassandra cluster that's a quite heavy network-wise operation.

    Leaving a Cassandra cluster's performance aside, network access is certainly more time-consuming than accessing a local file, isn't it?