I am doing some exercises of the DataStax VM.
A CassandraTable is given and I shall do some filtering and retrieving the top 5 elements using Spark API functions rather than cassandra-query-functions.
There I am doing the following:
val cassRdd = sc.cassandraTable("killr_video", "videos_by_year_title")
val cassRdd2 = cassRdd.filter(r=>r.getString("title") >= "T")
println("1" : + cassRdd2)
println("2" : + cassRdd2.count)
println("3" : + cassRdd2.take(5))
println("4" : + cassRdd2.take(5).count)
Results in:
What I have expected:
The solution given by Datastax uses the RDD and does a map-transformation on it, to only take the title and on that new title-rdd it does the filtering and the take-command.
Ok, works, but I don't understand, why take does not work on a RDD-of CassandraRow or what the result of that may be.
val cassRdd2 = cassRdd.map(r=>r.getString("title")).filter(t >= "T")
I thought the take-command on any RDD (regardless its contents) would do always the same, taking the first x elements resulting in a new RDD of the exact same type with a size of x elements.
rdd.take(n)
actually moves n
elements to the driver and returns them as an array, see ScalaDoc. If you want to print them:
println("3" : + cassRdd2.take(5).toList)
or cassRdd2.take(5).foreach(println)
. The last line does not work as the method is called length
(or size
) for arrays:
println("4" : + cassRdd2.take(5).length)