scala apache-spark cassandra spark-cassandra-connector

RDD of CassandraRow not working with take-command - why?

I am doing some exercises of the DataStax VM.

A CassandraTable is given and I shall do some filtering and retrieving the top 5 elements using Spark API functions rather than cassandra-query-functions.

There I am doing the following:

val cassRdd = sc.cassandraTable("killr_video", "videos_by_year_title")
val cassRdd2 = cassRdd.filter(r=>r.getString("title") >= "T")
println("1" : + cassRdd2)
println("2" : + cassRdd2.count)
println("3" : + cassRdd2.take(5))
println("4" : + cassRdd2.take(5).count)

Results in:

1: MapPartitionsRDD[185] at filter at :19
2: 2250
3: [Lcom.datastax.spark.connector.CassandraRow;@56fd2e09
4: compile Error (missing arguments for method count in trait TraversableOnce

What I have expected:

1: and 2: work as expected
3: returns only one row? I would expect a RDD of 5 cassandra Rows
4: this isn't the rdd count after 3:, hence I didn expect it to work, looks like its some kind of cassandraRow-count-method I was not intended to call

The solution given by Datastax uses the RDD and does a map-transformation on it, to only take the title and on that new title-rdd it does the filtering and the take-command.

Ok, works, but I don't understand, why take does not work on a RDD-of CassandraRow or what the result of that may be.

val cassRdd2 = cassRdd.map(r=>r.getString("title")).filter(t >= "T")

I thought the take-command on any RDD (regardless its contents) would do always the same, taking the first x elements resulting in a new RDD of the exact same type with a size of x elements.

Solution

rdd.take(n) actually moves n elements to the driver and returns them as an array, see ScalaDoc. If you want to print them:

println("3" : + cassRdd2.take(5).toList)

or cassRdd2.take(5).foreach(println). The last line does not work as the method is called length (or size) for arrays:

println("4" : + cassRdd2.take(5).length)