Search code examples
scalaapache-sparkcassandraspark-cassandra-connector

RDD of CassandraRow not working with take-command - why?


I am doing some exercises of the DataStax VM.

A CassandraTable is given and I shall do some filtering and retrieving the top 5 elements using Spark API functions rather than cassandra-query-functions.

There I am doing the following:

val cassRdd = sc.cassandraTable("killr_video", "videos_by_year_title")
val cassRdd2 = cassRdd.filter(r=>r.getString("title") >= "T")
println("1" : + cassRdd2)
println("2" : + cassRdd2.count)
println("3" : + cassRdd2.take(5))
println("4" : + cassRdd2.take(5).count)

Results in:

  • 1: MapPartitionsRDD[185] at filter at :19
  • 2: 2250
  • 3: [Lcom.datastax.spark.connector.CassandraRow;@56fd2e09
  • 4: compile Error (missing arguments for method count in trait TraversableOnce

What I have expected:

  • 1: and 2: work as expected
  • 3: returns only one row? I would expect a RDD of 5 cassandra Rows
  • 4: this isn't the rdd count after 3:, hence I didn expect it to work, looks like its some kind of cassandraRow-count-method I was not intended to call

The solution given by Datastax uses the RDD and does a map-transformation on it, to only take the title and on that new title-rdd it does the filtering and the take-command.

Ok, works, but I don't understand, why take does not work on a RDD-of CassandraRow or what the result of that may be.

val cassRdd2 = cassRdd.map(r=>r.getString("title")).filter(t >= "T")

I thought the take-command on any RDD (regardless its contents) would do always the same, taking the first x elements resulting in a new RDD of the exact same type with a size of x elements.


Solution

  • rdd.take(n) actually moves n elements to the driver and returns them as an array, see ScalaDoc. If you want to print them:

    println("3" : + cassRdd2.take(5).toList)
    

    or cassRdd2.take(5).foreach(println). The last line does not work as the method is called length (or size) for arrays:

    println("4" : + cassRdd2.take(5).length)