Search code examples
apache-sparkapache-spark-sqlcassandraspark-streamingspark-cassandra-connector

How to convert RDD[CassandraRow] to DataFrame?


Currently this is how I am transforming Cassandrarow RDD to dataframe:

val ssc = new StreamingContext(sc, Seconds(15))

val dstream = new ConstantInputDStream(ssc, ssc.cassandraTable("db", "table").select("createdon"))

import sqlContext.implicits._

dstream.foreachRDD{ rdd =>
    val dataframeJobs = rdd.map(w => (w.dataAsString)).map(_.split(":")).map(x =>(x(1))).map(_.split(" ")).map(x =>(x(1))).toDF("ondate")
}

As you can see, I am converting first cassandraRow rdd to string first, and then mapping to the format I want. I find this method to get complicated as when the rdd contains multiple coloumns instead of just one (createdon) as shown in the example.

Is there any other alternative and easy way to convert cassandraRow RDD to dataframe?

My build.sbt is as follows:

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(
  "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.1",
  "org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
  "org.apache.spark" %% "spark-sql" % "2.0.2",
  "org.apache.spark" %% "spark-streaming" % "2.0.2"
)

Solution

  • I figured out an alternative way which could work with any number of coloumns effectively:

    rdd.keyBy(row => (row.getString("createdon"))).map(x => x._1).toDF("ondate")