Search code examples
scalaapache-sparkdataframerdd

Mismatched Array type (scala.Array vs Array) for an RDD[Array[String]]


I'm trying to convert a dataframe to an RDD[Array[String]] in spark and currently to do so, I use the following method:

case class Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)

val newData = df.distinct.map {
  case Row(c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer) => Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
}

val newRDD = newData.rdd

This gives me what appears to be a transformation from a dataframe to an RDD[Array[String]] - however when I wrap it up in a function; like so:

 def caseNewRDD(df: DataFrame): RDD[Array[String]] ={
    case class Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
    val newData = df.distinct.map {
      case org.apache.spark.sql.Row(c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer) => Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
    }
    val newRDD = newData.rdd
    newRDD
  }

I get the following error:

Expression of type org.apache.spark.rdd.RDD[Array[scala.Predef.String]] doesn't conform to expected type org.apache.spark.rdd.RDD[scala.Array[scala.Predef.String]]

I'm guessing the Array type that I'm generating isn't conforming properly, but I cannot figure out why.

Any help would be appreciated.


Solution

  • You cannot cast types in Scala like that.

    case class Array[String](c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer)
    

    means: create NEW type Array with a type alias String. What you are trying to accomplish is:

    def caseNewRDD(df: DataFrame): RDD[Array[String]] = {
      df.distinct.map {
        case Row(c0:Long, c1:Integer, c2:Long, c3:String, c4:Integer, c5:Integer, c6:Integer) => 
          Array(c0.toString, c1.toString, c2.toString, c3, c4.toString, c5.toString, c6.toString)
      }.rdd
    }
    

    That is - I explicitly convert my types to Strings without actually creating new type.