I'm trying to convert a dataframe to an RDD[Array[String]] in spark and currently to do so, I use the following method:
case class Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
val newData = df.distinct.map {
case Row(c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer) => Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
}
val newRDD = newData.rdd
This gives me what appears to be a transformation from a dataframe to an RDD[Array[String]] - however when I wrap it up in a function; like so:
def caseNewRDD(df: DataFrame): RDD[Array[String]] ={
case class Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
val newData = df.distinct.map {
case org.apache.spark.sql.Row(c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer) => Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
}
val newRDD = newData.rdd
newRDD
}
I get the following error:
Expression of type org.apache.spark.rdd.RDD[Array[scala.Predef.String]] doesn't conform to expected type org.apache.spark.rdd.RDD[scala.Array[scala.Predef.String]]
I'm guessing the Array type that I'm generating isn't conforming properly, but I cannot figure out why.
Any help would be appreciated.
You cannot cast types in Scala like that.
case class Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
means: create NEW type Array
with a type alias String
. What you are trying to accomplish is:
def caseNewRDD(df: DataFrame): RDD[Array[String]] = {
df.distinct.map {
case Row(c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer) =>
Array(c0.toString, c1.toString, c2.toString, c3, c4.toString, c5.toString, c6.toString)
}.rdd
}
That is - I explicitly convert my types to String
s without actually creating new type.